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1  Introduction 


The  goal  of  the  Research  and  Development  for  Image  Understanding  Systems  (RADIUS) 
project  IS  to  develop  image  understanding  (lU)  algorithms  that  support  model-based  aerial 
photomterpretation.  The  Computer  Vision  Research  Laboratory  at  the  University  of  Mas¬ 
sachusetts  (UMass)  is  funded  under  a  three-year  contract  to  develop  algorithms  for  auto¬ 
mated  geometric  site  modeling.  Delivery  will  be  in  the  form  of  a  set  of  lU  modules,  each 
containing  a  flexible  set  of  programs  for  performing  modeling  tasks.  Although  each  individ¬ 
ual  task  IS  highly  automated,  the  image  analyst  (lA)  maintains  control  of  the  site  modeUng 
session  by  deciding  when  and  where  each  program  is  applied. 

Our  goal  is  to  provide  automated  lU  support  for  the  site  modehng  process.  Given  a 
current  partial  site  model,  and  a  set  of  images  of  the  site,  we  provide  lU  routines  that 
extend  the  model  to  include  previously  unmodeled  site  features  (model  extension),  and 
that  reduce  the  inaccuracies  in  the  existing  model  (model  refinement).  This  process  is 
repeated  as  new  images  become  available,  each  updated  model  becoming  the  current  site 
model  for  the  next  iteration.  Over  time,  the  site  model  will  be  steadily  improved  to  become 
more  complete  and  more  accurate.  Routines  for  model-to-image  registration  support 
these  tasks  by  automatically  determining  the  relative  position  and  orientation  of  each  image 
with  respect  to  the  local  site  coordinate  system.  This  information  is  needed  for  projecting  the 
current  model  onto  the  new  image  for  change  detection,  and  to  determine  what  parts  of  the 
new  image  are  not  currently  covered  in  the  site  model.  The  most  difficult  modeling  task  to 
automate  is  model  acquisition,  the  generation  of  an  initial  site  model  from  scratch.  Early 
versions  of  the  system  will  rely  on  lA  guidance  to  perform  this  important  ta.;k,  although 
work  is  underway  to  more  fully  automate  this  process  as  well. 

This  report  describes  research  activities  performed  during  the  first  year  of  the  UMass 
RADIUS  contract.  October  1992  -  October  1993.  Tie  rett  of  this  sectiL  presents  aS 
overview  of  the  project;  subsequent  sections  are  devoted  to  in-depth  technical  discussions  of 
the  algorithms  that  have  been  developed  so  far. 


1.1  Program  Activities 

UMass  plans  to  deliver  a  set  of  software  modules  that  provide  lU  support  for  site  model 

acquisition,  extension,  and  refinement.  Each  module  is  listed  below,  along  with  a  brief 
description. 

1.  The  Feature  Extraction  module  generates  symbolic  feature  descriptions  of  incoming 
imagery  in  order  to  reduce  the  representational  gap  between  incoming  sensor  data  and  stored 
geometric  site  model  descriptions.  Algorithms  are  currently  available  for  straight  line  segment 
extraction  and  comer  detection, 

2.  The  Model-to-Image  Registration  module  registers  incoming  imagery  with  a  store 
geometric  site  model  by  aligning  their  respective  coordinate  systems.  This  process  allows  a 
site  model  graphic  to  be  overlayed  on  the  image  and  is  the  first  step  in  model  extension  and 
refinement.  Registration  is  achieved  in  two  stages:  3D-to-2D  feature  correspondence  match- 
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ing,  which  determines  the  correspondence  between  site  model  features  and  extracted  image 
features,  and  robust  pose  determination-,  which  uses  these  correspondences  to  determine  the 
precise  position  and  orientation  of  the  camera  in  the  scene. 

3.  The  Model  Extension  module  contains  routines  used  for  site  model  acquisition, 
extension  and  refinement.  Given  a  set  of  interesting  features  seen  in  one  image,  an  epipolar 
feature  matching  procedure  uses  known  image  poses  to  secirch  for  corresponding  geometric 
features  in  other  images  taken  from  different  viewpoints.  Corresponding  features  are  passed 
to  a  multi-image  triangulation  routine  that  determines  the  position  of  the  new  3D  features 
in  the  local  site  coordinate  system.  The  triangulation  routine  can  also  be  used  to  refine 
the  positions  of  old  model  features  based  on  new  observations  detected  via  model-to-image 
registration.  This  module  is  not  yet  fully  developed,  and  may  eventually  contain  a  building 
detection  routine  for  automatically  finding  new  buildings  to  be  added  to  the  site  model. 

4.  The  Vanishing  Point  Analysis  module  can  be  used  to  detect  groups  of  oriented  man¬ 
made  structures  in  the  scene,  to  determine  the  relative  orientation  of  3D  line  structures  from 
a  single  image,  and  to  generate  an  initial  estimate  of  the  rotational  component  of  camera 
pose.  An  algorithm  for  vanishing  point  detection  groups  extracted  straight  line  segments 
into  sets  of  lines  directed  towards  the  dominant  scene  vanishing  points.  Precise  statistical 
estimation  of  vanishing  point  orientation  is  provided  for  determining  the  relative  orientation 
of  lines  in  the  scene  with  respect  to  the  camera,  and  vice  versa. 

5.  The  Image-to-Image  Registration  module  determines  feature  correspondences 
directly  between  two  or  more  images  without  going  through  a  model-to-image  registration 
step.  It  is  thus  useful  when  no  models  zure  yet  available,  as  is  the  case  during  initizil  model 
acquisition.  When  the  pose  of  each  image  is  already  known,  the  epipolar  feature  matching 
zdgorithm  included  in  the  Model  Extension  module  czin  be  used.  When  the  pose  is  not 
known,  epipolar  matching  cannot  be  used  -  for  this  case  we  have  experimented  with  an 
algorithm  for  automated  image  rectification  that  transforms  the  unconstredned  perspective 
image-to-image  feature  matching  problem  into  one  that  can  be  solved  using  simpler  similarity 
matching.  In  the  future,  this  module  will  contziin  a  correlation-based  stereo  algorithm  suitable 
for  generating  dense  digital  terrziin  maps. 

6.  The  Projective  Structure  Recovery  module  is  the  subject  of  future  research  into 
nontraditional  algorithms  for  structure  recovery.  In  particular,  algorithms  for  planar  surface 
recovery  via  projective  invariants  and  SD  stru<^re  recovery  via  invariants  zire  currently  being 
investigated. 

1.2  Schedule 

Below  is  a  three-year  schedule  showing  the  significant  milestones  we  expect  to  achieve  during 
the  course  of  the  RADIUS  contract. 

FY93 

-  Test  model-to-image  correspondence  matching 

-  Test  robust  pose  recovery 

-  Evaluate  vanishing  point  analysis 
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-  Demonstrate  model  extension  (point  features) 

-  Begin  porting  code  to  RCDE 
FY94 

Demonstrate  model  refinement  '  ■  ■ 
Demonstrate  model  extension  (line  features) 

-  Demonstrate  model  extension  (volumetric  models) 

FY95  ’ 

-  Demonstrate  acquisition  of  initial  site  models 
Test  model  reconstruction  based  on  invariance 
Symbolic  extraction  of  building  surface  structure 

-  Quantitive  evaluation  of  all  algorithms 

-  Finish  porting  all  code  to  RCDE 


1.3  Design  Philosophy 

The  UMass  design  philosophy  emphasizes  model-directed  processing  under  a  rigorous  3D 
perspective  camera  model.  Information  from  multiple  images  is  fused  for  increased  TZacy 

a  1  y,  using  a  flexible  set  of  modules  under  image  analyst  control.  These  design 
choices  are  motivated  in  the  paragraphs  below.  ^ 

The  UMass  system  is  designed  around  the  goal  of  buUding,  using  and  maintaining  ge¬ 
ometric  site  models.  This  choice  is  based  on  the  idea  that  model-directed  processing 
provides  a  way  to  perform  basic  image  understanding  tasks  more  efficiently  and  reliably 
than  purely  bottom-up  procedures.  An  example  is  model-to-image  registration  where  3D 
features  in  a  site  model  are  automaticaUy  matched  with  2D  image  features  and  used  to  com- 
pu  e  camera  pose,  thus  avoiding  laborious  hand-selection  of  accurate  control  points  in  the 
images.  The  very  names  of  tasks  like  model  extension  and  model  refinement  were  chosen  to 
emphasize  the  model-based  nature  of  the  UMass  system. 

previous  aerial  image  analysis  systems  that  assumed  nadir  views  only  the 
UMass  system  can  handle  oblique  views  as  well.  This  is  possible  because  the  svstem  is 

to  handle  gener^  viewpoints;  indeed,  many  of  the  algorithms  being  used  in  this  proiert  are 
ported  directly  from  the  Unmanned  Ground-Vehicle  project,  where  they  were  used  in  the 

dSl°to  bravddeT“‘"“‘  r'’?  ‘”^os,  rather  than  being  troublesome 

data  to  be  avt^ed,  are  instead  a  boon  to  aerial  photointerpretation.  Sets  of  oblique  images 

T  viewpoints  tend  to  have  large  camera  baselines  and  vergence 

angles,  which  lead  to  more  accurate  triangulation  results.  ^ 

systems  deal  with  single,  monocular  images  or  a  stereo  pair  of  images  but  the 
UMass  system  is  designed  to  support  multi-image  processing.  This  design  decision  is 
based  on  the  observation  that  incorporating  information  from  many  views  at  each  stage  of 
processing  yields  more  rehable  results.  One  example  is  a  multi-image  trianguiron  routine 

a  vlLTceTatT"'^  M  2  ^ 

batch  style  oror  /  proportional  to  1/n  .  Multi-image  processing  lends  itself  weU  to 

atch-style  processing  of  several  images;  however  it  is  also  possible  to  formulate  recursive 
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versions  that  support  incremental  processing  over  time. 

The  imagery  analyst  plays  a  key  role  in  the  system  being  developed.  Perhaps  the  most 
important  task  for  the  lA  is  determining  what  site  features  are  important  enough  to  be  worth 
adding  to  the  model.  To  this  end,  the  lA  can  guide  the  model  extension  process  by  pointing 
to  or  outlining  areas  of  interest  that  warrant  detailed  photogrammetric  processing.  The  lA 
wiU  also  be  of  invaluable  help  for  tuning  a  small  set  of  operational  peurameters  for  each  new 
domain  of  interest,  and  for  interactively  guiding  the  system  through  the  initizil  stages  of 
acquiring  a  new  site  model.  Future  lU  research  will  focus  on  ways  to  further  automate  the 
site  model  acquisition  process. 

1.4  Report  Overview 

Each  subsequent  section  in  this  document  looks  at  one  of  the  six  modules  listed  in  Section  1.1 
in  more  detail.  Progress  to  date  on  the  module  is  reported,  and  sample  results  are  shown 
where  aveiilable. 


2  Feature  Extraction 

A  site  model  wiU  invariably  be  specified  at  a  much  higher  level  of  geometric  abstraction  than 
e  pixel  intensity  values  in  an  image.  To  Kelp  bridge  the  huge  representational  gap  between 
pixels  and  site  models,  feature  extraction  routines  wiU  be  available  to  produce  symbohc 
geometric  representations  of  potentially  important  image  features.  Many  types  of  geometric 
features  can  be  extracted  from  incoming  site  imagery,  including  straight  hne  segments  Hne 
pencils  (sets  of  hne  segments  that  would  projectively  intersect  at  a  single  point  if  infiiiitely 
ex  ended),  rectihnear  hne  groupings,  curves,  corner  points,  regions  of  homogeneous  intensity 
and  ^extu^red  areas.  Our  current  model  matching  and  extension  algorithms  rely  exclusively 
on  straight  hne  segments  and  corner  features.  Extracted  hne  segments  are  later  grouped 
into  hne  pencils  by  the  vanishing  point  analysis  module  (see  Section  3). 


2.1  Straight  Line  Extraction 


Two  stra,ight  hne  segment  extraction  algorithms  developed  previously  at  UMass  have  been 
ev^uated  on  the  RADIUS  imagery.  The  Burns  algorithm  [11]  begins  by  labehng  pixels 
in  the  intensity  image  according  to  coarsely  quantized  gradient  orientation.  A  connected- 
components  algorithm  then  agpegates  sets  of  adjacent  pixels  with  similar  gradient  orienta- 
lon  into  hne-support  regions,  i.e.  pixel  regions  with  an  intensity  surface  that  supports  the 
presence  of  a  straight  hne  segment.  Symbohc  hne  segments  are  computed  by  intersecting  a 
plane  corresponding  to  the  average  intensity  of  the  hne-support  region  with  a  least-squares 

plane  representing  the  underlying  intensity  surface  in  that  region.  Sample  results  are  shown 
m  Figure  1. 


The  second  hne  extraction  algorithm  to  be  evciluated  was  the  Boldt  algorithm  [lOK  Sam- 
p  e  results  xom  this  algorithm  are  shown  in  Figure  2.  At  the  heart  of  the  Boldt  algorithm 
IS  a  hierarchical  grouping  system  inspired  by  the  Gestalt  laws  of  perceptual  organization 
Zero-crossing  points  of  the  Laplacian  of  the  intensity  image  provide  an  initial  set  of  local 
intensity  edges.  Hierarchical  grouping  then  proceeds  iteratively;  at  each  iteration,  edge  pairs 
are  hnked  and  replaced  by  a  single  longer  edge  if  their  end  points  are  close  and  their  orien¬ 
tation  and  contrast  (difference  in  average  intensity  level  across  the  hne)  are  similar.  Each 
Iteration  results  in  a  set  of  increasingly  longer  Jine  segments. 

The  Burns  hne  extractor  runs  much  faster  than  the  Boldt  extractor  because  it  makes 
fewer  local  decisions  at  each  stage  of  processing  and  maintains  fewer  intermediate  data 
structures  Indeed,  a  stripped  down  version  of  the  Burns  algorithm  has  been  used  as  a  fast 
hne  finder  for  real-time  robot  navigation  experiments  [18].  On  the  other  hand,  the  Boldt  hne 
extractor  generaUy  produces  more  accurate  results  (compare  Figures  1  and  2).  In  particular 
e  urns  algorithm  occasionaUy  produces  hne  segments  which  are  considerably  skewed  in 
orien  a  ion.  This  problem  is  caused  by  oddly-shaped  support  regions  resulting  from  slow 
intensity  gradient  changes  in  the  image.  A  promising  solution  to  this  problem  has  been 
devised  and  is  currently  being  implemented.  For  the  moment,  however,  the  less  accurate  but 
speedy  Burns  algorithm  is  ideal  for  providing  a  “quick  look”  at  new  data  by  generating  a  set 
of  hne  segments  suitable  for  early  stages  of  image  understanding,  such  as  dehneating  man- 
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Figure  1:  Straight  line  segments  extracted  by  the  Burns  algorithm. 


Figure  2;  Straight  line  segments  extracted  by  the  Boldt  algorithm. 


made  objects,  computing  initial  estimates  of  camera  orientation  via  vanishing  point  analysis 
(see  Section  3),  and  even  coarse  model-to-image  registration  (Section  4).  Meanwhile,  the 
more  accurate  Boldt  extractor  is  run  off-line  to  provide  the  highly  accurate  line  features 
needed  for  precise  3D  geometric  analysis  and  model  construction. 

Current  implementations  of  both  algorithms  are  unable  to  handle  the  1300x1000  RADIUS 
modelboard  1  images  in  a  single  chunk.  When  processing  such  large  images,  several  options 
are  available: 

1.  use  a  focus  of  attention  mechanism  to  determine  subwindows  in  which  lines  will  be 
extracted 

2.  break  the  image  into  subchunks  that  are  processing  and  then  pieced  back  together 

3.  reduce  the  size  of  the  entire  image  via  subsampling. 

For  our  experiments  using  the  modelboard  imagery,  a  combination  of  the  2nd  and  3rd  options 
is  used.  First,  image  resolution  is  reduced  by  half  using  Gaussian  filtering  and  subsampling. 
Each  reduced  image  is  then  cut  into  a  mosaiic  of  overlapping  subimages,  each  to  be  processed 
separately  by  the  appropriate  line  extraction  algorithm.  All  line  segments  found  are  trans¬ 
lated  and  scaled  back  into  their  originEil  image  coordinate  system,  then  filtered  so  that  each 
line  segment  in  the  fined  set  has  a  length  of  at  least  10  pixels  long  and  a  contrast  of  at  least 
15  intensity  levels.  This  procedure  produces  roughly  2800  line  segments  per  image.  As  a  side 
benefit  of  this  strategy,  Gaussian  image  reduction  edleviates  the  peculiar  “sawtooth”  noise 
pattern  that  corrupts  severed  of  the  modelboard  1  images. 


2.2  Oriented  Corner  Extraction 

A  new  oriented  corner  extractor  was  developed  to  automatically  and  accurately  extract 
building  corners.  The  corner  extractor  was  designed  to  complement  the  straight  line  segment 
extractors  previously  described.  Initial  design  requirements  for  the  corner  extractor  were  that 
it  should: 

1.  be  able  to  locate  both  dihedred  and  triKedfad  corners, 

2.  not  be  sensitive  to  building  materiad  and  lighting  direction  (this  rules  out  grey-scale 
template  matching), 

3.  not  be  an  excessive  number  of  false  positives  in  textured  areas  and 

4.  not  be  computationally  expensive  to  run,  since  it  will  be  used  over  large  images. 

One  common  approach  to  finding  corners  is  to  seairch  for  two  or  more  line  segments 
with  endpoints  in  proximity,  then  estimating  the  corner  vertex  as  the  ir.  ersection  of  the 
extended  lines.  This  approach  was  rejected  since  we  wanted  an  independent  source  of  feature 
information  that  would  complement  the  extracted  line  segments,  not  share  their  defects. 
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Irien^ri  f  method  was  developed:  potential  corners  are  detected  using  a  set  of 

^tected  ""  image,  then  the  vertex  position  of  each 

T},-  IS  estimated  to  subpixel  precision  using  a  least-squares  fitting  procedure 

This  two-stage  approach  yields  a  good*  balance  of  computation  -  the  imprecise  bu/fast 

squ^Ls  estimaf  '°T  ^  expensive,  but  higHy  accurate,  least- 

squares  estimation  procedure  is  used  to  precisely  locate  each  candidate  corner. 

For  greater  computational  expedience,  it  is  assumed  that  the  expected  orientation  of 
hnes  in  he  image  meeting  to  form  a  corner  is  known.  This  assumption  is  val^d  frc^rLs 
of  buildings  oriented  with  respect  to  three  predominant,  mutually  orthogonal  directions 

the  imre  of?r^  ’'f  fl  I"  ‘^ese  cases,  the  expected  orientation  in 

t^he  image  of  the  mutuaUy  orthogonal  edges  forming  a  dihedral  or  trihedral  corner  can  be 

etermined  automaticaUy  by  vanishing  point  analysis  (see  Section  3).  Although  budding 
edges  with  the  same  3D  orientation  can  project  to  2D  edges  with  diLent  orientations  a! 

fferent  locations  m  the  image,  this  variation  is  predictable  and  often  varies  so  slowly  that 
the  same  oriented  template  is  valid  over  large  portions  of  the  image. 

Intermediate  steps  in  the  oriented  corner  detection  algorithm  are  iUustrated  in  Figure  3 

J  for  each  of  the  three  dominant  orientations  in  the  scene,  an  oriented  Une 

endpoint  template  is  b^t  and  convolved  with  the  Canny  edge  image  (the  result  of  convo¬ 
lution  with  one  oriented  template  is  shown  in  Figure  3c.)  Each  template  is  designed  to  give 
a  strong  positive  response  at  one  end  of  a  chain  of  edges  having  the  correct  orientation  a 

vXe'of  etlTf  ^h f  I*"'  '■"Fonse  everywhere  else.  The  absolute 

value  of  each  of  the  convolution  images  is  then  taken  and  the  three  are  added  together  The 

Xdrll^c  (Fietire  3d)  in  the  vicinity  of  dihedral  and 

trihedral  comers,  for  each  peak  attaining  a  given  threshold,  a  comer  hypothesis  is  formed. 

Choosing  a  locaUy  m^mal  peak  in  the  “comerness”  image  only  localises  the  vertex  of 
a  corner  to  within  a  pmel  at  best,  and  can  be  off  by  two  or  three  pixels  due  to  imprecision 

medsioT  T  5°i  ‘'"-plates.  To  locaHse  the  vertex  position  to  subpixel 

p  ecision,  a  constrained  least-squares  procedure  is  performed.  For  each  hypothesised  ror„.r 

s  ditTd  n  ^  ‘o  determine  whether  the  com« 

“  ;r'  U  r  ’  “  directions  the  contributing  edge!  chains  radiate.  Two 

Ltdb  to  meet  at  a  single  vertex  are  then  lit  simultaneously  to  the 

noted  AhhouehTh  Ut  ”  “  “d  the  position  of  the  best-fit  vertex  is 

noted.  Although  the  least-squares  fitting  procedure  is  moderately  expensive  compared  to  the 

mitiai  comer  detection  phase,  it  is  only  mn  in  places  where  there  is  already  stZrevidence 

at  a  corner  exists.  The  final  result  of  the  comer  feature  extractor  is  a  set  of  symbolic 

ima”r  of  eubpixel  vertex  estimate,  and  the  number  and  orientation  of 

image  edges  meeting  at  that  vertex  (Figure  4). 
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Figure  3.  Illustration  of  oriented  corner  detection,  (a)  Gray-scale  intensity  image,  (b) 
Canny  edge  image,  (c)  Results  of  convolving  the  Canny  edge  image  with  one  oriented 
template  determined  from  vanishing  point  information,  (d)  The  final  comerness  image. 
Local  intensity  peaks  in  this  image  correspond  dihedral  and  trihedral  corner  hypotheses. 
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3  Vanishing  Point  Analysis 

Buildings  in  urban  or  industrial  eireas  eire  often  oriented  with  respect  to  an  underlying  rect¬ 
angular  grid  (city  blocks,  for  example): 'When  viewed  from  the  air,  their  roof  lines  appear  to 
converge  to  two  distinct  vanishing  points  located  on  the  horizon.  For  a  nadir  view,  parallel 
3D  roof  lines  remain  parallel  in  the  image,  and  their  vanishing  points  are  then  said  to  occur 
at  infinity.  A  set  of  lines  that  projectively  meet  at  a  common  point  of  intersection  is  called 
a  line  ■pencil.  Under  known  camera  lens  parameters,  vanishing  point  line  pencils  allow  the 
computation  of  three-dimensional  line  and  plane  orientations  from  a  single  image,  and  thus 
allow  the  orientation  of  the  camera  to  be  determined  with  respect  to  the  underlying  loccii  site 
coordinate  grid.  When  the  camera  lens  parameters  are  not  known,  they  can  be  determined 
to  a  limited  extent  from  vanishing  point  information  [28]. 

A  practical  algorithm  for  estimating  line  orientations  from  vanishing  points  must  address 
two  issues:  how  to  cluster  line  segments  into  pencils  of  lines  passing  through  a  single  van¬ 
ishing  point,  and  how  to  accurately  estimate  3D  line  orientations  from  these  line  pencils. 
Sections  3.1  and  3.2  discuss  these  topics  in  turn.  An  experimented  evaluation  of  the  accuracy 
of  3D  line  orientations  computed  from  vanishing  point  information  is  reported  in  Section  3.3. 


3.1  Vanishing  Point  Detection 

Vamshing  point  detection  involves  finding  clusters  of  lines  directly  towards  a  single  point 
of  intersection.  A  Hough-transform  approach  originally  due  to  Barnard  excels  at  quickly 
clustering  line  segments  into  intersecting  groups  [6].  Line  segments  in  the  image  are  mapped 
onto  a  histogram  representing  the  surface  of  the  unit  sphere  +  z'^  =  1  centered 

about  the  camera  focal  point.  In  practice,  only  the  positive  hemisphere  z  >  0  needs  to  be 
represented,  and  the  surface  of  the  hemisphere  is  partitioned  by  longitude  and  colatitude 
(see  Figure  5).  The  sphere  is  a  more  appropriate  histogram  space  than  the  image  plane  for 
detecting  vanishing  points  because  the  sphere  is  a  compact,  finite  surface,  while  the  image 
plane  is  not. 

Each  line  segment  in  the  image,  taken  together  with  the  camera  focal  point,  forms  a  pro¬ 
jection  plane  which  intersects  the  unit  (hemi)sphere  in  a  great  (semi)circle.  Each  histogram 
bucket  maintains  a  count  of  the  number  o£.great  circles  pzissing  through  it;  potential  van¬ 
ishing  points  are  detected  as  peaks  in  the  histogram,  corresponding  to  cireas  where  severed 
great  circles  intersect.  Barnard  chose  the  center  of  the  histogram  bucket  contedning  a  peak 
as  a  point  estimate  of  the  vanishing  point  location. 

We  have  modified  Barnard’s  basic  algorithm  to  support  statistical  estimation  techniques 
that  more  accurately •  determine  the  true  vanishing  point  location.  The  most  fundamental 
change  is  that  the  histogram  data  structure  is  applied  only  as  an  initial  clustering  method 
and  as  an  efficient  spatial  access  mechanism,  but  the  finad  analysis  of  vanishing  point  lo¬ 
cation  is  performed  on  the  underlying  fine  segment  data.  To  achieve  this,  each  histogram 
t  icket  maintains  a  list  of  the  fine  segments  that  pass  through  it  in  addition  to  the  count. 
After  a  peak  is  detected,  the  line  segments  within  it  are  retrieved  and  statistical  estimation 
techniques  are  applied  to  derive  a  more  accurate  estimate  of  vanishing  point  position  (see 
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cut  here 


Figure  5;  Barnard’s  histogram  method  for  finding  vanishing  points.  A  hemispherical  his¬ 
togram  is  partitioned  by  longitude  and  colatitude.  For  each  line  segment  in  the  image,  a 
great  circle  of  histogram  cells  is  incremented.  Potential  vanishing  points  are  detected  as 
peaks  in  the  histogram,  corresponding  to  areas  where  several  great  circles  intersect. 
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Section  3.2). 

Peak  selection  in  Hough-transform  methods  is  inherently  problematic  [21].  In  particular, 
when  the  true  position  of  a  vanishing  point  on  the  sphere  falls  near  a  histogram  boundary' 
candidate  line  segments  that  should  be  grouped  together  fall  into  separate  buckets.  For  this 
reason  we  actually  collect  line  segments  from  a  whole  neighborhood  of  buckets  surrounding 
the  peak.  We  have  implemented  a  routine  to  visit  every  grid  cell  falling  within  a  circle  of 
given  radius  p  from  a  given  point  on  the  hemisphere,  taking  into  account  the  wraparound 
that  must  occur  due  to  the  mapping  between  unit  hemisphere  and  2D  rectangular  array  of 
grid  cells. 

After  the  Ingest  cluster  of  converging  lines  is  detected  from  the  histogram,  all  lines 
contained  within  that  cluster  are  deleted  from  the  histogram,  and  all  bucket  counts  axe 
updated.  The  highest  remaining  peak  is  taken  as  the  second  cluster,  and  so  on.  Multiple 
vanishing  point  peaks  are  thus  detected  in  decreasing  order  of  number  of  lines  contributing 
to  them.  The  number  of  vanishing  point  clusters  to  look  for  is  a  user-supplied  parameter  - 
usually  less  than  three  appezir  in  any  image. 

An  illustration  of  vanishing  point  detection  on  model  board  imagery  is  shown  in  Fig¬ 
ure  6.  Two  line  pencils  are  detected,  with  roughly  900  line  segments  in  each  pencil.  These 
correspond  to  the  two  dominant  sets  of  horizontal  line  segments  in  the  scene.  Detecting  two 
sets  of  scene  horizontals  via  vanishing  point  analysis,  and  thus  finding  the  horizon  line  of  the 
ground  plane  in  the  image,  has  proven  relatively  easy  in  the  model  board  images.  Detecting 
the  third,  vertical  vanishing  point  is  not  so  easy,  because  the  associated  image  lines  are  fairly 
short  and  tend  to  fall  below  the  line  length  filter  we  apply  in  an  effort  to  keep  the  amount 
of  line  data  more  manageable.  When  needed,  we  compute  the  vertical  vanishing  point  an¬ 
alytically  from  the  two  horizontal  vanishing  point  directions.  The  vertical  vanishing  point 
computed  in  this  way  can  be  used  to  search  for  short  lines  in  the  image  that  correspond  to 
vertical  edges  in  the  scene. 


3.2  Statistical  Estimation  of  Vanishing  Points 

While  Barnard’s  histogram  method  excels  at  quickly  clustering  line  segments  into  convergent 
groups,  a  final  estimate  of  vanishing  point  location  and  variance  should  be  based  on  the 
line  segments  themselves  rather  than  the  arbitrary  bucket  boundaries  of  a  histogram  data 
structure.  We  therefore  use  the  histogram  mechanism  only  for  clustering  and  efficient  spatial 
access,  while  estimating  vanishing  point  location  using  a  geometric  statistical  procedure. 
Because  the  vanishing  point  may  be  at  infinity  in  the  image  plane,  the  quantity  actually 
estimated  is  a  unit  vector  pointing  towairds  the  vanishing  point.  The  orientation  of  this 
unit  vector  is  also  an  estimate  of  the  3D  orientation  of  pairailel  lines  having  that  particular 
vanishing  point. 

Collins  and  Weiss  [12]  present  a  formal  statistical  approach  to  estimating  line  orientations 
from  vanishing  points.  Assuming  that  image  line  segments  have  been  previously  grouped  into 
line  pencils,  the  projection  plane  normals  for  each  pencil  form  a  set  of  points  on  the  sphere 
clustered  about  a  great  circle  perpendicular  to  the  true  vanishing  point  orientation  vector 
(see  Figure  7).  Collins  and  Weiss  treat  this  cluster  as  a  random  sample  from  an  equatorial 
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Figure  6:  An  example  of  vanishing  point  detection,  (a)  Modelboard  1  image  J8.  (b)  Straight 
line  segments  extracted  by  the  Burns  algorithm  (see  Section  2.1).  (c)  and  (d)  The  two  largest 
line  pencils  found  using  the  vanishing  point  clustering  algorithm  described  in  the  text.  These 
line  pencils  correspond  to  sets  of  parallel  3D  line  segments  in  the  scene. 
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Figure  7;  Vanishing  point  estimation  on  the  sphere.  The  projection  plane  normals  of  line 
segments  in  a  vanishing  point  pencil  cluster  around  a  great  circle  on  the  unit  sphere.  The 
poleir  axis  of  the  best-fitting  great  circle  provides  an  estimate  of  vanishing  point  location  in 
the  image,  and  of  the  3D  orientation  of  parallel  lines  associated  with  that  vanishing  point. 


probability  density  function  on  the  sphere,  and  estimate  vanishing  point  orientation  as  the 
polar  axis  of  the  density  function  using  standard  maximum-likelihood  estimation  techniques. 

Although  it  works  quite  well  in  practice,  the  maximum  likelihood  estimator  of  Collins 
and  Weiss  has  one  main  shortcoming.  All  lines  in  a  line  pencil  me  treated  equally,  contrary 
to  practical  experience  which  shows  that  longer  lines  are  detected  more  accurately  than 
shorter  ones,  and  thus  should  be  given  more  credence  when  estimating  line  intersections. 
A  new  statistical  estimator  has  now  been  developed,  based  on  Bayesian  estimation  of  the 
axis  of  a  great  circle  of  noisy  observed  points.  The  unknown  true  values  of  each  point  are 
assumed  to  be  coustraiined  to  lie  exactly  on  a  gieat  circle,  and  each  observation  is  assumed 
to  be  perturbed  by  an  independent  probability  density  function  on  the  sphere  with  a  local 
covariance  computed  from  a  simple  random  perturbation  error  model  on  the  associated  image 
line  segment.  This  hm  the  effect  of  assigning  smaller  variances  to  longer  lines  in  the  image, 
and  hence  assigning  them  a  higher  weight  in  the  resulting  estimation  process. 

3.3  Evaluating  the  Accuracy  of'Orientation  Estimates 

An  experimental  evaluation  of  the  effectiveness  of  the  new  vanishing  point  orientation  es¬ 
timator  was  carried  out  on  images  J1-J8  from  the  modelbomd  1  data  set.  The  goal  of 
this  experiment  was  to  compare  3D  line  orientations  computed  by  the  statistical  estimator 
against  absolute  ground  truth  orientations.  The  ground  truth  orientations  were  provided  by 
Lynn  Quam  at  SRI  Internationcil,  who  computed  them  using  a  multi-image,  block  adjust¬ 
ment  procedure  [3]  where  the  camera  pose  parameters,  effective  camera  focal  length,  and 
principle  point  (image  center)  for  ail  eight  images  were  computed  simultaneously.  It  should 
be  noted  that  the  Hne  orientations  derived  from  Quam’s  comput  -d  camera  pose  values,  are 
themselves,  only  estimates  of  the  true  relative  line  orientations.  However,  since  Quam’s 
computations  are  based  on  precisely  measured  3D  ground  and  2D  image  control  points,  they 


Table  1:  Absolute  orientation  errors  in  estimated  vanishing  point  orientations,  and  in  the 
relative  angle  between  estimated  orientations. 


Image 

Abs  Err  in 
VPl  (deg) 

Abs  Err  in 
VP2  (deg) 

Rel  Ang 
(deg) 

J1 

3.7 

1.0 

91.2 

J2 

3.7 

2.8 

91.7 

J3 

1.5 

1.1 

89.5 

J4 

4.7 

0.5 

87.3 

J5 

1.9 

0.5 

91.1 

J6 

2.4 

2.3 

90.3 

J7 

1.9 

6.8 

87.7 

J8 

2.0 

3.5 

91.9 

avg 

2.7 

2.3 

90.1 

std 

1.1 

2.1 

1.8 

are  expected  to  be  inherently  more  accurate  than  vanishing  point  estimation  methods  that 
are  based  only  on  generic  knowledge  about  parsiilelism  in  the  scene. 

The  experiment  was  carried  out  as  follows.  For  each  of  eight  images  J1-J8,  a  set  of 
line  segments  was  produced  using  the  Burns  straight  line  extraction  algorithm  (Section  2.1). 
This  procedure  produced  roughly  2800  line  segments  per  image.  Baxncurd’s  method  was 
then  used  to  histogram  the  set  of  line  segments  from  each  image,  and  the  two  largest  line 
pencils  were  extracted  (e.g.  Figure  6).  In  aU  cases,  the  two  leirgest  line  pencils  in  each 
image  corresponded  to  the  tv'o  dominant  orthogonal  sets  of  lines  in  the  3D  scene.  Following 
extraction  of  line  pencils,  the  vanishing  point  orientation  estimator  was  applied,  and  the 
resulting  3D  line  orientation  estimates  were  compared  against  the  ground  truth  relative 
orientations  for  north-south  and  east-west  line  segments  in  the  scene. 

Table  1  taUies  the  results  of  this  compairison.  For  each  image,  absolute  errors  between  es¬ 
timated  and  ground  truth  line  orientations  for  the  two  dominant  orthogonal  vanishing  points 
m  the  scene  are  compared,  along  with  the  computed  relative  angle  between  the  estimated 
orientations.  Sample  means  for  the  absolute-errors  in  the  two  vanishing  point  orientations 
are  2.7  and  2.3  degrees.  The  average  relative  angle  between  the  two  estimated  vanishing 
points  orientations,  which  in  reality  is  90  degrees,  was  computed  as  90.1  degrees  with  a 
standard  deviation  of  roughly  2  degrees. 

We  note  in  passing  that  although  the  model  boaird  scenes  are  ideal  for  vanishing  point 
detection,  in  the  sense  that  there  are  many  parallel  lines  in  two  distinct  orientations,  they 
are  among  the  harder  cases  for  accurate  vanishing  point  estimation  since  the  convergence 
angle  of  hne  segments  across  the  image  is  never  very  large.  In  these  images  each  vanishing 
point  is  significantly  off  the  image  plane,  and  in  many  cases,  is  far  off,  near  infinity.  The 
benefits  of  computing  hne  intersections  on  the  sphere  rather  than  the  extended  image  plane 
are  very  apparent  in  these  cases. 
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4  Model-to-Image  Registration 

Matching  and  registration  is  a  ubiquitous  problem  in  computer  vision.  Correspondence 
matching  can  be  broken  into  two  general  areas:  model- to-image  registration  where  corre¬ 
spondences  cire  identified  between  known  3D  model  features  and  their  2D  counterparts  in 
an  image,  from  which  the  absolute  pose  of  the  camera  is  computed,  and  image-to-image 
registration  where  corresponding  features  in  two  images  of  the  same  scene  must  be  deter¬ 
mined,  and  from  them  the  relative  pose  of  the  image  paiir.  Image-to-image  registration  will 
be  covered  in  Section  6. 

Accurate  model-to-image  registration  is  a  necessary  precursor  for  many  site  modeling 
tasks.  Proper  registration  between  an  incoming  image  and  a  stored  geometric  site  model 
determines  the  position  and  appezirance  of  important  model  features  in  the  image.  The 
model  can  then  be  overlaid  on  the  image  to  add  visual  change  detection  and  verification  of 
expected  scene  features.  Model-to-image  registration  using  templates  for  movable  objects 
such  as  trucks  and  railway  cars  provides  a  method  for  locating  and  counting  these  vehicles. 
FinaiUy,  correspondences  determined  by  matching  3D  model  features  with  2D  image  features, 
using  an  initiad  estimate  of  camera  pose,  can  be  used  in  conjunction  with  a  pose  refinement 
algorithm  to  recover  a  more  accurate  estimate  of  camera  pose. 

The  model-to-image  registration  process  involves  two  tasks:  1)  correspondence  matching 
to  determine  correspondences  between  model  features  and  image  features,  aind  2)  camera 
resection  to  determine  the  precise  geometric  relationship  between  the  image  and  the  scene. 
The  first  task  is  by  far  the  hardest,  and  its  success  depends  on  finding  a  good  initial  set 
of  model-to-image  correspondences  automaticcdly.  Whether  a  good  set  of  initial  correspon¬ 
dences  can  be  found  e2isily  or  not  depends  on  the  quality  and  completeness  of  initial  estimates 
of  the  image  acquisition  pmameters.  As  a  general  rule-of-thumb,  the  difficulty  of  finding  cor¬ 
respondences  is  directly  proportional  to  the  number  of  unknown  acquisition  pcirameters  (un- 
constrciined  degrees  of  freedom  in  the  model-to-image  transformation  space)  and  the  amount 
of  uncertainty  in  the  acquisition  pmameters  that  are  known.  Correspondence  matching  is 
discussed  in  Section  4.1. 

The  second  aspect  of  model-to-image  registration  is  camera  resection.  It  is  important 
to  note  that  since  model-to-image  correspondences  are  being  found  automatically  here,  sub¬ 
sequent  camera  resection  routines  need  to  take  into  account  the  possibility  of  mistakes  or 
outliers  in  the  set  of  correspondences  founST  This  implies  that  the  camera  resection  rou¬ 
tines  need  to  use  robust  estimation  procedures  that  aire  impervious  to  the  effects  of  outliers. 
When  the  internal  orientation  or  “lens”  parameters  of  the  camera  are  accurately  known,  and 
only  the  the  externad  orientation  or  “pose”  parameters  need  to  be  precisely  determined,  the 
camera  resection  process  reduces  to  a  pose  estimation  problem.  A  robust  pose  estimation 
procedure  has  been  developed  and  tested  at  UMass,  and  it  will  be  described  in  Section  4.2. 
We  conclude  in  Section  4.3  with  an  experimental  evaluation  of  model  matching  and  pose 
determination  using  a  pair  of  images  from  the  Martin  Marietta,  Denver,  Colorado  site. 
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4.1  Model  Matching 

The  goal  of  model  matching  is  to  find  the  correspondence  between  3D  features  in  a  site  model 
and  2D  features  that  have  been  extracted  from  an  image.  For  example,  we  currently  use 
site  models  that  are  composed  of  wireframe  buildings;  model  matching  in  this  case  involves 
determining  correspondences  between  edges  in  a  3D  building  wireframe  and  2D  extracted  line 
segments  from  the  image.  To  find  this  correspondence,  we  are  evailuating  a  model  matching 
algorithm  developed  at  UMass  by  Ross  Beveridge  [7].  Based  on  a  local  search  approach 
to  combinatorial  optimization,  this  algorithm  solves  for  the  correspondence  between  model 
edges  and  image  line  segments,  and  for  the  transformation  that  brings  the  projected  model 
into  the  best  possible  geometric  alignment  with  the  underlying  image  data. 

The  local  search  matching  algorithm  searches  the  discrete  space  of  correspondence  map¬ 
pings  between  model  and  image  features  for  one  that  minimizes  a  match  error  function.  The 
match  error  depends  upon  the  relative  placement  implied  by  the  correspondence,  and  the 
amount  of  coverage  of  the  model  by  the  data.  That  is, 

■^match  ~  -^fit  -^omission  •  (1) 

In  particular,  fit  error  is  computed  by  choosing  pose  parameters  that  cause  projected  model 
edges  to  appear  most  similar  to  the  image  edges  currently  hypothesized  to  be  in  correspon¬ 
dence  with  them.  The  similarity  between  image  edges  and  projected  model  edges  is  measured 
by  a  least-squares  residual  fit  error.  The  omission  portion  of  the  error  term  is  computed  as 
the  percentage  of  projected  model  lines  having  no  corresponding  data  lines  to  explain  them. 
The  mathematical  trcinsformation  that  maps  model  features  into  the  image  is  essenticJly  a 
module  of  the  system.  Our  current  implementation  handles  the  four  peirameter  2D  simi¬ 
larity  transform  and  the  full  3D  pose  transform.  These  options  cire  described  more  fully  in 
Section  4.1.2. 

To  find  the  optimal  match,  probabilistic  local  search  relies  upon  a  combination  of  iterative 
improvement  and  random  sampling.  Iterative  improvement  refers  to  a  repeated  generate- 
and-test  procedure  by  which  the  algorithm  moves  from  an  initial  match  to  one  that  is  locally 
optimal  via  a  sequence  of  incremental  changes  (addition  or  removal  of  a  model-data  cor¬ 
respondence  pair)  that  continually  reduce  the  match  error.  In  an  effort  to  find  the  global 
optimum,  the  algorithm  is  run  multiple  times  starting  from  different  (random)  initial  cor¬ 
respondences  in  the  model-to-data  line  match- space.  Even  if  the  probability  of  seeing  the 
optimal  match  on  a  single  trial  is  low,  the  probability  of  seeing  the  optimal  match  in  a  large 
number  of  trials,  started  from  uniformly  random  positions  in  the  match  space,  is  high. 

Figure  8  presents  a  pictorial  vignette  of  the  model  matching  process.  Given  an  incoming 
image  (8a),  the  first  step  is  to  extract  a  set  of  2D  symbolic  image  features  (8b)  -  in  this 
case,  straight  line  segments  extracted  by  the  Burns  algorithm  [11].  Also  provided  is  a  3D 
wire-frame  model  (8c)  containing  prominant  site  features  whose  locations  are  known  with 
respect  to  the  local  site  coordinate  system.  This  model  will  be  registered  with  the  image, 
and  in  so  doing,  the  position  and  orientation  of  the  camera  in  the  local  site  coordinate 
system  will  be  determined.  The  3D  model  is  projected  into  the  image  usir^  an  initial,  rough 
estimate  of  the  camera  pose  parameters  (8d),  and  a  set  of  candidate  correspondences  for  each 
projected  wireframe  edge  is  selected  from  the  underlying  image  fine  segments  (8e).  The  model 
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Figure  8:  A  pictorial  summary  ci"  the  model  matching  process,  (a)  Model  board  image  J8. 
(b)  Burns  lines  for  image  J8.  (c)  Partial  wire-frame  model,  (d)  Initicd  model  projection,  (e) 
Candidate  data  correspondences,  (f)  Best  match  of  model  to  data. 
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matcher  then  searches  through  this  space  of  possible  modei-to-image  feature  correspondences 
to  determine  the  best  set  of  matching  features.  The  optimal  correspondence  found  is  used 
to  estimate  the  camera  pose  that  best  brings  projected  model  features  into  alignment  with 
their  corresponding  image  features  (8f). 

The  following  subsections  explore  some  of  the  practical  issues  involved  in  applying  the 
local  search  model  matching  system.  Topics  treated  include  initial  estimation  of  the  image 
acquisition  p^ameters,  selection  of  a  set  of  possible  initial  correspondences,  the  trade-off 
between  efficiency  and  accuracy  when  choosing  the  number  of  degrees  of  freedom  in  the 
model-to-image  transformation  space,  and  use  of  a  hierarchical  matching  strategy  to  further 
reduce  the  combinatorics  of  correspondence  matching. 

4.1.1  Setting  up  the  Match  Space 

Initial  Estimates  of  Acquisition  Parameters 

The  unconstrained  correspondence  matching  problem  is  still  an  unsolved  research  prob¬ 
lem  in  computer  vision.  As  a  practical  matter,  initial  estimates  of  the  image  acquisition  pa¬ 
rameters  need  to  be  available  beforehand  to  cut  the  enormous  number  of  potentiail  matches 
down  to  a  manageable  level. 

We  model  the  image  acquisition  process  as  a  projective  transformation  from  the  3D 
scene  onto  a  2D  image  plane.  Using  homogeneous  coordinates,  this  transformation  can  be 
represented  by  a  3  x  4  matrix  of  parameters,  defined  up  to  a  single  scale  factor.  Thus, 
although  there  are  12  entries  in  the  transformation  matrix,  there  are  only  11  independent 
parameters.  These  parameters  can  be  further  subdivided  into  5  internal  (lens)  parameters, 
and  6  external  (pose)  parameters,  and  may  be  arranged  in  the  following  matrix  formula, 
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which  maps  the  3D  coordinates  s,  y,  and  z  of  a  model  point  into  the  2D  coordinates  u  and  v 
of  an  image  point.  Here,  the  transformation  from  model  to  image  coordinates  is  broken  into 
two  stages,  a  3  x  4  matrix  representing  camera  pose  and  the  process  of  perspective  projection, 
followed  by  application  of  a  3  x  3  matrix  of  camera  lens  parameters.  The  pose  parameter 
matrix  is  further  partitioned  into  a  3  x  3  orthonormal  rotation  matrix  R  (containing  3 
degrees  of  freedom)  and  a  3  x  1  translation  vector  t  (containing  the  remaining  3  degrees  of 
pose  freedom).  The  5  camera  lens  parameters  are  represented  by  and  s„,  which  are  the 
focal  lengths  in  pixels  along  each  of  the  image  axes,  uo  and  vq,  which  are  the  pixel  coordinates 
of  the  camera  principle  point,  and  an  image  axis  skew  parameter  a,  which  is  always  assumed 
to  be  0  unless  information  to  the  contrary  exists. 

For  our  experiments,  initial  estimates  of  the  camera  lens  and  pose  parameters  for  the 
RADIUS  model  board  imagery  were  determined  as  follows.  First,  nominal  values  for  the 
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internal  camera  pairameters  were  filled-in  from  information  supplied  with  the  model  board 
data  (namely  6.8  micron  square  pixels  and  a  foccil  length  of  47.9  mm  for  the  18  inch  GSD 
images)  and  by  assuming  the  principle  point  to  be  in  the  numeric  center  of  the  image.  To 
estimate  the  externcil  camera  parameters-,  the  orientation  of  the  camera  with  respect  to  the 
scene  was  first  determined  by  vanishing  point  analysis  (see  Section  3),  up  to  a  four-fold 
ambiguity  that  was  resolved  by  identifying  the  direction  of  true  north  in  the  image  by  hand. 
The  distance  of  the  Ccimera  from  the  ground  was  computed  from  the  reported  Ground  Scale 
Distance  (GSD);  to  date  our  experiments  have  only  used  the  18  inch  GSD  images.  Finally, 
the  intersection  of  the  camera’s  line  of  sight  with  the  ground  plane  was  estimated  manually. 

Selecting  the  Correspondence  Space 

Model  matching  performs  a  sezu-ch  through  the  space  of  possible  model  to  data  corre¬ 
spondences.  This  space  is  initially  set  up  by  deciding  which  data  lines  in  the  image  are 
to  be  considered  as  candidate  matches  for  each  model  line.  Ccireful  pruning  of  this  space 
at  the  start  is  crucial  to  achieving  tractable  run  times.  The  problem  is  greatly  simplified 
when  good  initial  estimates  of  the  image  acquisition  pzirameters  are  available.  The  better 
the  initial  estimates,  the  tighter  the  filters  for  picking  out  possible  candidate  data  lines  can 
be,  both  in  terms  of  orientation,  position  in  the  image,  and  length. 

Even  though  the  metric  used  to  score  potentizil  correspondences  is  purely  geometric 
(Equation  1),  photometric  expectations  such  as  the  sign  zind  magnitude  of  contrast  across 
a  line  can  be  enforced  by  prefiltering  for  these  properties  in  the  initial  candidate  generation 
phase.  Our  tendency  has  been  to  underspecify  rather  than  overspecify  filter  parameters, 
however,  because  once  a  correct  line  pairing  has  been  excluded  by  overzealous  filtering  in 
the  candidate  generation  stage,  that  correct  pziiring  can  never  contribute  to  the  match  that 
is  eventuzJly  found. 

For  the  experiments  we  have  run  on  the  model  board  imagery,  we  project  each  model  line 
into  the  image  using  the  initizil  estimate  of  image  acquisition  parameters  described  above. 
AU  image  line  segments  having  an  orientation  within  10  degrees  of  the  projected  model  line 
orientation,  and  located  within  100  pixels  laterally  from  it,  are  then  selected  as  possible 
matching  candidates. 

4.1.2  Model-to-Image  Transforms 

Essentially  all  model-to-image  correspondence  problems  involve  solving  for  a  discrete  corre¬ 
spondence  between  model  and  image  features  along  with  an  associated  transformation  map¬ 
ping  model  features  into  the  image.  The  two  problems  together  constitute  model  matching: 
a  match  being  a  correspondence  plus  a  transformation.  The  most  general  transformation 
typically  considered  involves  full  3D  pose:  a  rigid  3D  model  is  rotated  and  translated  relative 
to  the  camera  and  then  projected  into  the  image  using  a  known  camera  model.  This  amounts 
to  solving  for  the  pose  parameters  in  Equation  2,  while  holding  the  lens  parameters  fixed. 

Unfortunately,  this  approach  to  matching  is  very  expensive  computationally.  To  see 
why,  consider  the  match  error  metric  introduced  in  Equation  1,  which  is  a  combination  of 
geometric  fit  and  omission  terms.  To  compute  these  terms  for  a  particular  set  of  model- 
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to-data  correspondence  pairs  involves  solving  for  the  transformation  that  brings  proiected 
model  features  into  best  ahgnment -with  their  hypothesized  corresponding  data  lines,  where 
est  is  defined  as  minimizing  the  residual  least-squares  geometric  fit  error.  Thus,  to  eUluate 
each  state  in  the  space  of  potential  correspondence  matches,  a  full  3D-to-2D  pose  solution 
must  be  computed.  There  is  no  known  closed-form  solution  to  the  pose  problem  for  an 
arbitrary  number  of  points  and  Hnes  -  exact  solutions  require  iterative,  nonlinear  least- 
squares  tec  niques.  In  a  typical  run,  the  model  matcher  searches  through  thousands  of 
possible  match  states  in  its  search  for  the  optimum  correspondence.  If  at  each  match  state 
one  needs  to  solve  the  fuU  3D-to-2D  pose  problem,  the  resulting  algorithm  is  prohibitively 


f  llInT  advantage,  then,  in  looking  for  alternative  methods  to  replace  solving  the 

lull  3D-to:2D  pose  equations  exactly  at  each  step.  We  have  investigated  two  such  approaches 
both  originally  introduced  by  Beveridge  [8].  These  methods  are  four-parameter  2D-to-2D 
atfane  (similarity)  matching,  and  hybrid  3D-to-2D  matching. 

2D-to-2D  Matching 

The  idea  behind  2D-to-2D  matching  approximations  to  the  full  3D-to-2D  correspondence 
problem,  is  that  given  good  initial  estimates  of  the  lens  and  pose  parameters  of  the  camera 
the  3D  model  can  be  projected  onto  the  image  before  matching  begins,  turning  the  problem 
into  a  search  for  the  2D  transformation  that  best  brings  the  2D  projected  model  features 
into  correspondence  with  the  image  data  features.  This  is  the  underlying  motivation  for  the 
2D  similarity  version  of  the  model  matcher,  which  seeks  the  best  four  parameter  affine  or 
sirmlartty  transforin  relating  a  projected  set  of  wireframe  model  edges  with  a  set  of  image 
data  hne  segments  [7].  ° 

The  2D  similarity  transform  is  a  planar  transformation  consisting  of  four  parameters- 
a  rotation  angle  in  the  plane,  a  2  parameter  translation  in  the  plane,  and  a  single  global 
chanp  of  sc^e.  It  IS  a  subgroup  of  the  more  general  6-parameter  affine  transformation  group. 

nding  the  best  2D  similarity  transform  between  a  set  of  corresponding  model  and  data  Hnes 
1^5  much  faster  than  solving  for  full  3D  pose.  Indeed,  Beveridge  devised  a  closed-form  solution 
for  determining  the  similarity  transform  that  brings  a  set  of  2D  model  Hnes  into  aHgnment 
with  their  corresponding  data  Hnes.  Because  the  similarity  transform  matcher  is  fast  it  is 

^  increasing  the  confidence  in 

nncling  the  best  correspondence. 

Problems  with  2D-to-2D  Matching 

fT  'a?  experiments  using  the  2D  similarity  matcher  on  images 

J1  J8  of  the  Model  Board  1  data  set,  using  a  simple  wireframe  model  generated  from  the 
dU  ground  truth  data  points  provided.  Initial  estimates  of  the  camera  pose  parameters  were 
Z  P^^'^io^sly  described,  then  purturbed  along  their  various  degrees  of  freedom 

by  hand.  Our  experiments  showed  that  the  performance  of  the  2D-to-2D  similarity  version 
o  he  matcher  was  sensitive  to  the  accuracy  of  the  initial  camera  pose.  This  was  not  entirely 
unexpected  -  a  good  initial  pose  estimate  is  crucial  for  2D-to-2D  matching  since  the  initial 
pose  determines  what  2D  projected  model  is  matched  against  the  data.  Once  it  has  been 
projected  using  the  initial  pose,  the  resulting  2D  model  can  thereafter  only  be  rotated 
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translated  and  scaled  in  2D  in  an  effort  to  fit  it  properly  to  the  image.  We  had,  however, 
expected  this  version  of  the  matcher  to  work  relatively  well  in  the  aerial  domain,  where  the 
camera  is  far  from  the  objects  being  viewed,  as  opposed  to  “close-range”  domeiins,  such  as 
hallway  navigation,  where  small  differences  in  viewpoint  can  cause  large  changes  in  object 
appezirance. 

Further  study  revezded  that  the  2D  similarity  matcher  was  particularly  sensitive  to  the 
accuracy  of  the  initial  estimate  for  the  camera  look  vector.  The  reason  for  this  can  be 
explciined  by  considering  the  relationship  between  similarity  transformations  of  a  2D  model 
in  the  image  and  the  effects  of  the  six  degrees  of  freedom  of  camera  pose  on  the  initicd 
projection  of  the  model.  Errors  in  the  initizd  camera  pose  estimates  lead  to  projected  model 
templates  that  deviate  from  their  ideal  2D  location,  orientation,  size,  and  shape  in  the  image. 
The  four  parameter  similarity  transformation  can  correct  for  errors  in  2D  image  location 
caused  by  incorrect  specification  of  the  horizontal  location  (e.g.  latitude  and  longitude)  of 
the  camera  above  the  ezirth,  errors  in  image  rotation  due  to  misspecification  of  the  amount 
of  rotation  of  the  camera  about  the  line  of  sight,  and  global  image  scale  changes  caused 
by  incorrect  specification  of  the  distance  of  the  camera  from  the  scene.  The  2D  similzirity 
transform  cannot  recover  from  errors  in  image  object  shape,  however.  These  shape  errors 
are  due  to  incorrect  estimates  of  the  direction  and  amount  of  object  foreshortening,  and  are 
a  direct  result  of  errors  in  the  initial  orientation  estimate  of  the  camera  look  vector. 

Figure  9  illustrates  an  example  of  the  problem  that  can  occur  due  to  incorrect  estimates 
of  foreshortening.  Figure  9a  shows  an  initial  projected  model  template  overlayed  on  one 
of  the  model  board  images.  The  most  notable  discrepancy  at  this  stage  is  a  large  error 
in  translation  of  the  2D  model  template  with  respect  to  its  correct  position  in  the  image. 
Figure  9b  shows  the  results  of  applying  the  best  four  parameter  similarity  mapping  found  by 
the  2D  local  search  matcher,  and  Figures  9c  and  9d  are  zoomed  images  showing  the  match  in 
greater  detail.  The  overlap  between  the  2D  model  template  and  the  image  has  been  greatly 
improved  -  a  four  paurameter  similarity  transformation  has  been  found  that  accurately  ahgns 
the  model  template  on  three  out  of  four  sides  of  the  main  building.  However  the  fourth 
side,  shown  in  detail  in  Figure  9d,  is  still  considerably  misaligned.  It  has  to  be  stressed 
that  this  is  not  a  failure  of  the  2D  matching  algorithm,  per  se.  It  has  found  the  best 
possible  match  zdlowed  under  these  circumstances.  The  problem  is  that  no  combination 
of  2D  rotation,  translation  and  uniform  sczding  can  possibly  bring  the  initial  projected  2D 
model  into  alignment  with  the  underlying  inmge,  because  of  the  excessive  foreshortening  in 
one  direction  of  the  model  template  due  to  an  inaccurate  initizd  estimate  of  the  camera  look 
vector. 

Hybrid  3D-to-2D  Matching 

We  have  observed  the  foreshortening  problem  described  above  in  many  of  our  2D-to- 
2D  matching  experiments  on  the  model  bozird  images.  This  has  led  us  to  abandon  the  2D 
similarity  matching  system  in  favor  of  a  hybrid  3D-to-2D  perspective  matching  system  also 
developed  by  Beveridge  [9].  This  system  combines  the  speed  of  2D  similzirity  matching  with 
the  accuracy  of  full  3D  pose  matching.  In  particular,  the  hybrid  matcher  interleaves  2D 
similarity  matching  with  3D  pose  updates.  Stairting  with  an  initial  projection  of  the  3D 
model  to  form  a  2D  model  template,  a  full  hiU-climbing  cycle  of  the  2D  similarity  matching 
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gure  .  o  el  matching  using  the  2D  similarity  transform,  iUustrating  the  effects  of  fore- 

projection  of  model  template,  (b)  Overlay  after  2D  similarity 
matching,  (c)  Close-up  showing  good  aUgnment  on  three  sides  of  the  model  (d^ 
Close-up  showing  poor  alignment  on  the  fourth  side,  due  to  uncompensated  foreshortening 
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system  is  performed  from  a  random  start  in  correspondence  space.  Once  a  local  optimum 
is  reached,  the  final  correspondence- between  3D  model  lines  and  2D  image  line  segments 
is  passed  to  a  full  3D  pose  determination  algorithm  (see  Section  4.2).  The  computed  pose 
provides  a  new  estimate  of  camera  location  and  orientation.  This  is  used  to  reproject  the  3D 
model  to  create  a  new  2D  model  template,  and  2D  local  scEirch  similarity  matching  begins 
anew.  The  process  converges  when  no  change  to  the  set  of  model-to-image  correspondences 
Ccin  be  found  that  improves  the  match  score. 

The  crucial  difference  that  allows  the  3D-to-2D  hybrid  matcher  to  work  in  cases  where 
the  2D-to-2D  matcher  feuls  is  the  3D  pose  estimation  and  model  reprojection  step  occuring 
after  each  2D  similarity  matching  episode.  This  series  of  pose  updates  incrementally  corrects 
for  any  foreshortening  due  to  incorrect  initial  estimates  of  camera  pose.  On  the  other  hand, 
most  of  the  matcher’s  computation  time  is  spent  evaluating  the  effects  of  adding  or  removing 
p2Lirs  of  potenticil  correspondences,  and  in  the  hybrid  matcher,  this  work  is  being  done  using 
the  simpler  2D  similarity  transformation  equations,  for  which  a  fast  closed-form  solution 
exists.  The  resulting  hybrid  system  thus  offers  a  clever  blend  of  speed  and  accuracy. 

Figure  10  illustrates  the  results  of  the  hybrid  3D-to-2D  matcher  on  the  same  matching 
problem  as  shown  in  Figure  9.  Figure  10a  shows  a  zoomed  in  portion  of  the  initial  projection 
of  the  3D  site  model  overlayed  onto  the  image,  and  Figure  10b  shows  agciin  the  results  of  the 
best  match  found  by  the  2D-to-2D  simileirity  matching  system.  The  results  of  the  hybrid 
3D-to-2D  matching  system  are  shown  in  Figures  10c  and  lOd.  These  figures  show  in  turn  the 
results  of  matching  only  rooftop  lines,  and  then  all  wireframe  model  lines,  using  a  hierarchical 
matching  strategy  that  will  be  outlined  below.  The  overlap  between  the  projected  model 
and  the  2D  image  is  now  very  good,  and  all  vestiges  of  foreshortening  errors  caused  by  the 
incorrect  initial  estimate  of  the  camera  look  vector  have  been  removed. 


4.1.3  Hierarchical  Matching  Strategy 

Based  on  experimentation  with  the  model  boeird  imagery,  further  steps  have  been  taken  to  cut 
down  on  the  combinatorics  of  the  matching  problem.  The  most  relevant  of  these  innovations 
is  to  use  a  hiercirchical  matching  strategy.  The  run  time  complexity  of  correspondence 
matching  in  each  individuail  image  is  governed  by  the  number  of  model  line  to  data  line 
candidate  pairs  that  are  considered.  The  number  of  correspondence  pairs  that  are  considered 
is  directly  related  to  parameters  governing^  the  size  of  a  search  window  airound  each  model 
line,  and  in  turn,  the  size  of  this  search  window  should  be  set  according  to  the  uncertainty  in 
the  user’s  current  knowledge  about  the  pose  of  the  camera.  A  two  stage  hierarchical  strategy 
is  currently  employed.  In  the  initizil  stage  when  pose  uncertednty  is  large,  only  horizontal 
model  lines  from  building  roofs  are  considered.  There  are  several  heuristics  behind  this 
choice:  roof  lines  comprise  roughly  a  third  of  the  total  number  of  lines  in  the  full  model, 
they  are  relatively  long  (as  compared  to  the  building  verticals),  and  they  are  less  likely  to 
be  occluded  (as  compared  to  the  horizontal  ground  lines). 

Once  the  best  corresnondence  and  pose  are  found  using  only  roof  lines,  the  second  stage  of 
matching  begins.  The  full  set  of  model  lines  is  projected  into  the  image  using  the  computeu 
pose,  and  hidden  lines  are  removed.  Matching  resumes  using  this  more  numerous  set  of  lines. 
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igure  10.  Model  matching  using  the  hybrid  3D-to-2D  matcher,  (a)  Close-up  of  initial 

effZTTf  rt  2D  similarity  model  matching  showing 

effects  of  foreshortening,  (c)  Close-up  of  results  for  3D-to-2D  matching  of  roof  lines  Note 

3D  to  2D  matT®  (d)  Close-up  showing  alignment  after 

3D-to-2D  matching  using  the  full  wireframe  model. 
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but  with  a  much  tighter  bound  on  secirch  window  size,  since  the  projection  of  the  model  into 
the  image  using  the  pose  computed  from  the  first  stage  should  be  fairly  accurate.  The  goad  is 
to  keep  a  near-constant  upper  bound  on  the  number  of  model-to-data  correspondence  pairs 
considered  at  each  stage,  thereby  keeping  total  match  time  roughly  constant.  During  the 
first  stage  there  are  fewer  model  lines,  but  many  more  data  candidates  to  consider;  during 
the  second  stage  many  more  model  lines  are  used,  but  each  has  only  a  small  number  of  data 
candidates  that  could  possibly  be  associated  with  it. 


4.2  Camera  Resection 

The  result  of  model  matching  is  a  set  of  model-to-image  feature  correspondences  between  3D 
wire-frame  edges  and  2D  image  line  segments.  The  next  stage  in  the  model  extension  process 
uses  these  correspondences  to  resect  more  accurate  estimates  of  camera  pose.  These  updated 
parameter  estimates  cire  then  used  to  triangulate  the  positions  of  new  scene  features.  Since 
automatically  generated  feature  correspondences  may  contziin  gross  errors,  called  outliers,  it 
is  important  for  any  subsequent  camera  resection  procedure  to  use  robust  statistical  meth¬ 
ods  when  computing  new  camera  parameters.  A  robust  cilgorithm  for  finding  3D  camera 
pose  from  a  set  of  model-to-image  point  and  line-based  feature  correspondences  has  been 
developed  at  UMass  in  a  Ph.D.  thesis  by  Kumar  [24,  25,  26]. 

At  the  heart  of  Kumar’s  robust  pose  code  is  an  iterative,  weighted  least-squcires  algorithm 
for  computing  pose  from  a  set  of  correspondences  that  are  free  from  outliers.  The  pose  pa¬ 
rameters  are  found  by  minimizing  an  objective  function  that  measures  how  closely  projected 
model  features  fall  to  their  corresponding  image  features.  For  each  point  correspondence 
the  objective  function  is  incremented  by  the  squared  residual  distance  from  the  image  point 
to  its  corresponding  projected  model  point,  weighted  by  the  covariance  of  the  extracted  im¬ 
age  point.  For  each  line  correspondence,  the  objective  function  is  incremented  by  the  sum 
of  squares  of  perpendicular  residual  distances  from  the  endpoints  of  the  image  line  to  the 
projection  of  its  corresponding,  infinitely-extended  model  line,  weighted  by  the  expected  co- 
variance  in  the  measured  image  line  endpoints.  Although  systems  in  the  past  have  proposed 
similar  objective  functions,  Kumar’s  method  of  solving  for  the  pose  parameters  differs  from 
previous  approaches  in  two  significant  ways.  First,  both  rotation  and  translation  parameters 
are  solved  simultaneously,  which  mzikes  more  effective  use  of  the  geometric  constraints  and 
is  more  accurate  in  the  presence  of  noise  tS&h  techniques  that  decompose  the  problem  by 
solving  for  rotation  first,  followed  by  translation.  Second,  the  nonlinear  least-squares  op¬ 
timization  algorithm  used  to  solve  for  rotation  and  translation  is  based  on  the  quaternion 
representation  of  rotations,  which  provides  much  better  convergence  properties  than  solution 
methods  based  on  Euler  angles.  The  results  of  this  basic  pose  algorithm  are  a  set  of  pose 
parameters  that  minimize  the  objective  function  and  a  covariance  matrix  that  estimates  the 
accuracy  of  the  solution. 

It  is  well  known  that  least  squares  optimization  techniques  can  fail  catastrophically  when 
outliers  are  present  in  the  data.  For  this  reason,  Kumar  embedded  the  basic  pose  algorithm 
described  above  inside  a  leaso  median  squares  (LMS)  procedure  that  repeatedly  samples 
subsets  of  correspondences  to  find  one  devoid  of  outliers.  This  approach  is  called  least 
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median  squares  because  it  in  effect  minimizes  the  median- squared  residucii  distance  error 

rather  than  the  mean-squared  distance.  LMS  is  robust  over  data  sets  containing  up  to  50% 
outliers. 

The  LMS  algorithm  works  by  repeatedly  choosing  subsets  of  a  given  size  K  from  the 
M  set  of  available  correspondences,  and  computing  their  implied  pose  using  the  basic  pose 
algorithm.  In  experiments,  a  subset  size  of  K=6  has  worked  well  [25].  For  each  subset, 
the  residual  squared  distance  term  associated  with  each  correspondence  in  the  set  is  ranked 
according  to  magnitude,  and  the  median  residual  is  noted.  After  all  subsets  of  size  K  are 
processed,  the  one  yielding  the  smallest  median  residual  error  is  selected  as  being  outlier- 
free.  Since  processing  all  subsets  of  size  K  can  be  computationally  expensive  when  there 
are  a  large  number  of  correspondences,  in  practice,  just  a  random  sample  of  these  subsets 
IS  generated,  where  the  number  of  random  subsets  that  need  to  be  considered  is  determined 

on  probabilistic  grounds  based  on  an  estimate  of  the  probable  number  of  outliers  present  in 
the  data. 

After  finding  a  (presumably)  outlier-free  subset,  its  computed  pose  could  be  used  as  a 
robust  pose  estimate  of  the  full  set  of  correspondences.  However,  one  fined  stage  of  pro¬ 
cessing  helps  to  improve  the  statistical  properties  of  the  pose  solution.  First,  the  full  set  of 
original  correspondences  is  examined  using  the  pose  computed  for  the  best  subset,  and  cor¬ 
respondence  pairs  are  removed  that  have  a  large  residual  error  as  compared  with  a  threshold 
etermined  by  robust  estimation  of  the  mean  and  variance  of  the  residuad  errors.  Then  the 
weighted  least-squares  pose  solution  is  run  on  the  remaining  correspondences,  which  are  as¬ 
sumed  to  be  outlier-free.  This  last  computed  pose  is  returned  as  the  final  pose  estimate.  The 
reason  this  last  stage  is  performed  is  to  increase  the  relative  statistical  efficiency  (decrease 
the  variance)  of  the  final  estimated  pose  parameters. 

Based ^n  a  model  of  image  noise  and  the  assumption  that  the  3D  model  data  is  accurate 
dosed  form  expressions  for  the  uncertainty  in  the  pose  estimation  results  have  been  derived.’ 
Kumar  has  shown  analytically  that  the  error  in  the  output  pose  parameters  is  linearly  related 
to  the  noise  in  the  input  feature  data  [25].  He  also  studied  the  effect  of  errors  in  estimates 
of  the  camera  principle  point  and  focal  length  on  the  resulting  pose,  showing  that  incorrect 
knowledge  of  the  principle  point  does  not  significantly  affect  the  computed  3D  location  of 
the  sensor  (although  the  computed  rotation  is  affected),  and  that  incorrect  estimation  of  the 
camera  focal  length  only  significantly  affects  the  estimate  of  camera  distance  from  the  scene. 

For  images  where  internal  camera  parameters  are  not  available,  or  not  known  very  accu¬ 
rately,  the  pose  determination  process  could  conceivably  be  extended  to  solve  for  both  lens 
(internal  orientation)  and  pose  (external  orientation)  parameters.  The  resulting  highly  non- 
near  set  of  equations  could  best  be  solved  if  multiple  images  taken  with  the  same  camera 
were  available,  in  which  case,  a  joint  optimization  procedure  could  be  used  to  determine  the 
single  set  of  lens  parameters  at  the  same  time  that  the  different  pose  parameters  for  each 
view  were  being  computed.  We  are  not  actively  investigating  this  approach  since  we  have 
een  assured  that  a  weU-caHbrated  set  of  lens  parameters  wiU  always  be  known  in  advance. 
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4.3  Evaluation  of  Matching  and  Pose 

The  Mcirtin  Marietta,  Denver  Colorado  stereo  pair  distributed  by  TEC  provided  a  nearly 
ideal!  set  of  data  for  quantitatively  evaluating  the  procedurail  chain  of  line  extraction,  model 
matching,  and  pose  determination.  Previous  evaluation  of  the  accuracy  of  pose  determination 
using  the  model  boaird  imagery  had  been  hampered  by  the  lack  of  good  ground  truth  pose 
measurements.  In  contrast,  very  good  ground  truth  camera  poses  were  provided  with  the 
Denver  stereo  pair.  The  only  drawback  is  that  no  ground  truth  model  of  the  building 
was  provided,  and  no  3D  control  points  were  available  from  which  such  a  model  could  be 
constructed.  Since  model  matching  and  pose  determination  reqiiire  an  initieil  3D  model,  one 
was  built  by  hand  from  the  stereo  pciir  by  hand-matching  distinctive  corner  points  between 
the  two  images  and  then  triangulating  to  derive  their  3D  positions.  From  these  points,  an 
initial  building  model  composed  of  3D  line  segments  was  created. 

The  camera  model  provided  by  TEC  with  the  Denver  images  was  more  elaborate  than 
the  one  used  by  our  model  matching  and  pose  determination  algorithms.  This  prompted 
us  to  update  our  camera  model  to  a  pinhole  perspective  projection  followed  by  a  full  six 
pcirameter  affine  lens  transformation.  This  is  the  most  general  linezir  camera  model  used  in 
computer  vision,  and  was  adequate  for  representing  the  imaging  transformations  involved 
in  the  Denver  stereo  pair.  Our  Ccimera  model  still  does  not  take  into  account  nonhneau' 
lens  aberrations,  however  no  significant  amount  of  nonline2ir  distortion  was  present  in  these 
images. 

Figure  11  shows  the  results  of  feature  extraction,  model  matching  and  pose  determination 
for  one  of  the  two  stereo  images,  labeled  75nxx.  Figure  11a  shows  a  set  of  line  segments 
extracted  from  a  portion  of  the  image  using  the  Burns  straight  line  extraction  algorithm  [11]. 
The  initial  3D  model  lines  built  by  hcind  are  shown  in  Figure  11b.  The  hybrid  3D-to-2D 
model  matching  algorithm  [9]  was  used  to  automatically  register  these  3D  model  lines  to  the 
set  of  extracted  2D  image  line  segments.  Figures  llc-d  show  ciose-ups  of  the  resulting  model- 
to-image  registration.  Following  model  matching,  an  estimate  of  the  optimal  camera  pose 
for  bringing  the  model  into  registration  with  the  image  data  was  computed  via  a  robust  pose 
determination  algorithm  [25].  This  computed  pose  was  then  comp2ired  to  the  ground  truth 
pose  provided  by  TEC.  For  image  74SXX,  the  difference  between  computed  and  ground  truth 
camera  locations  was  8.73  meters,  while  for  image  75NXX,  the  difference  was  6.91  meters. 
The  average  ground  truth  distance  from  the^czimera  to  the  building  was  831  meters  for  image 
74SXX  and  821  meters  for  image  75NXX;  therefore  the  relative  error  between  estimated  and 
ground  truth  camera  locations  with  respect  to  distance  from  the  modeled  object  was  1.05% 
and  0.84%  respectively.  Angular  deviation  between  computed  and  ground  truth  camera  look 
vectors  was  cilso  computed,  and  found  to  be  0.54  degrees  for  image  74SXX  and  0.41  degrees 
for  image  75NXX.  In  both  cases,  the  agreement  between  estimated  camera  pose  and  ground 
truth  pose  is  remarkably  good,  considering  the  limited  extent  in  the  image  of  the  building 
being  used  in  the  model  matching  and  pose  determination  process. 
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Figure  11:  Model-to-image  feature  registration  for  one  image  of  the  Martin-Marietta  Denver 
me.  (a)  Line  segments  from  a  portion  of  75nxx.  (b)  3D  partial  building  model,  (c)-(d) 
Close-up  of  the  projected  model  lines  overlaid  over  the  2D  image  data  lines  after  model 
registration  and  pose  determination. 
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5  Model  Extension 

Once  a  set  of  images  has  been  registered  using  a  partial  site  model,  the  goal  of  model  exten¬ 
sion  is  to  find  unmodeled  structures  and  add  them  into  the  model.  This  task  breaks  down 
into  three  stages:  detecting  “interesting”  features  in  each  image,  finding  the  correspondence 
of  these  features  across  multiple  images,  and  finally,  triangulating  to  recover  3D  feature 
locations  in  the  local  site  model  coordinate  system. 

The  algorithms  in  this  module  are  also  applicable  to  site  model  acquisition  and  refinement. 
The  process  of  model  refinement  updates  site  feature  locations  from  incoming  model-to- 
image  feature  matches  using  the  same  triangulation  algorithm  that  is  used  during  extension. 
Model  acquisition  is  basiccilly  the  invocation  of  model  extension  procedures  over  the  whole 
image.  The  main  difference  is  that  for  model  extension,  the  pose  of  each  image  can  be 
determined  by  model-to-image  registration  using  the  current  partied  site  model,  whereas  for 
model  acquisition,  the  pose  must  be  supplied  in  some  other  way. 

Figure  12  shows  a  pictorial  sketch  of  model  extension.  Figure  12a  shows  a  partial  site 
model  that  has  been  aligned  to  a  set  of  image  line  segments  using  the  model-to-image  reg¬ 
istration  module.  For  clarity,  only  candidate  data  line  segments  are  shown.  Note  that  one 
building  is  conspicuously  incomplete;  the  go£d  of  this  model  extension  example  is  to  com¬ 
plete  the  building  wireframe.  In  Figure  12b,  four  comer  features  were  selected  (by  hand) 
to  be  added  to  the  model  in  order  to  generate  the  missing  sides  of  the  building  roof.  The 
position  of  these  comer  features  in  the  image,  along  with  the  relative  pose  between  this 
image  and  another  view,  dictates  a  set  of  epipolar  lines  in  the  second  image,  along  which 
the  corresponding  corner  features  must  lie.  These  lines  can  be  truncated  at  maximum  and 
minimum  disparity  bounds  based  on  global  site  model  height  bounds  to  determine  a  set  of 
epipolar  line  segments,  shown  in  Figure  12c.  The  search  for  corresponding  comers  in  the 
second  image  is  carried  out  along  these  bounded  segments.  From  the  corresponding  image 
corners  that  are  found,  the  locations  of  3D  corner  features  are  triangulated,  and  the  missing 
roof  edges  are  then  added  to  the  model.  The  extended  model  can  now  be  used  for  other  site 
model  tasks.  Figure  12d  shows  a  close-up  of  the  new  site  model  registered  to  a  third  image, 
focusing  on  the  line  segments  that  were  added  to  complete  the  partial  building  model.  Note 
that  the  position  and  orientation  of  the  new  lines  coincide  well  with  the  underlying  image. 

The  remainder  of  this  section  describes  algorithms  for  epipolar  feature  matching  and 
triangulation,  and  considers  the  limitations  of  model  extension  based  on  low-level  features 
such  as  points  and  lines.  The  use  of  structured  object  descriptions  is  proposed  as  a  way 
eiround  these  limitations, 

5.1  Epipolar  Matching 

Epipolar  matching  is  easiest  to  describe  for  point  features.  To  find  the  3D  location  in  the 
scene  of  a  selected  image  corner  point  via  triangulation,  the  corresponding  projection  of  that 
corner  feature  in  a  second  image  must  be  located.  From  a  single  image,  aU  that  can  be 
determined  is  that  the  3D  corner  point  lies  along  a  viewing  ray  in  the  scene,  which  can  be 
reconstructed  from  the  2D  image  corner  location  and  the  image  acquisition  parameters.  This 
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Figure  12:  A  pictorial  summary  of  3D  model  extension,  (a)  Partial  site  model  registered  with 
an  image,  (b)  Corner  points  to  be  used  to  complete  one  building  wireframe,  (c)  Epipolar 
me  segments  overlayed  onto  corner  points  in  a  second  image  for  epipolar  feature  matching. 
Matched  corners  are  then  triangulated,  (d)  The  completed  building  model,  a..tomatically 
regis  ere  wi  h  a  third  image  (close-up  of  the  wireframe  edges  that  were  added). 
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ray,  as  seen  from  a  second  camera,  appezurs  as  an  epipolar  line  of  points  in  the  second  image, 
and  the  corresponding  projection  of  the  3D  point  into  the  second  image  must  lie  on  this  line. 
This  is  the  well-known  epipolar  constraint,  a  good  discussion  of  which  appears  in  [5]. 

The  epipolar  constraint  greatly  simplifies  the  search  for  corresponding  features  in  the 
second  image  by  cutting  the  size  of  the  seeirch  space  down  from  the  whole  image  to  a  thin 
region  straddling  the  appropriate  epipolar  line.  One  additionzd  simplification  is  possible 
when  the  maximum  and  minimum  height  of  features  in  the  scene  can  be  roughly  bounded. 
In  this  case,  the  infinite  viewing  ray  of  possible  3D  point  positions  can  be  truncated  at  the 
maximum  and  minimum  Z-coordinate  height  bounds,  and  the  resulting  3D  line  segment 
projected  to  form  a  2D  epipolar  line  segment  in  the  second  image.  This  strategy  cuts  down 
the  size  of  the  search  region  even  more,  leading  to  faster  and  more  reliable  matching  results. 

The  main  problem  in  epipolzir  matching  is  ambiguity.  When  many  corner  features  have 
been  detected  in  the  image,  it  is  highly  likely  that  several  of  them  will  lie  in  the  epipolar 
search  region.  In  this  case,  the  matching  problem  boils  down  to  disambiguating  among  the 
many  potential  matches  to  settle  on  a  single  best  one.  Our  current  implementation  uses  a 
very  simpHstic  strategy,  namely  filtering  corner  features  on  expected  orientation  and  then 
choosing  the  one  that  lies  closest  to  the  epipolar  line.  More  sophisticated  matching  strategies 
are  possible,  the  most  promising  being  the  use  of  a  third  image  for  disambiguation  [5].  The 
idea  is  to  perform  epipolar  matching  from  image  1  to  image  2,  to  get  a  set  of  potential 
matches,  then  from  image  1  to  image  3  to  get  another  set  of  possibilities.  These  sets  of 
potential  matches  eire  then  filtered  for  consistency  with  respect  to  each  other;  each  pair  of 
disparities  (1-2)  and  (1-3)  must  be  consistent  with  coming  from  the  same  3D  point  in  the 
scene.  Many  are  not,  and  are  weeded  out  as  potential  matches.  We  are  currently  evaluating 
this  approach,  and  are  also  generalizing  the  epipolar  matching  system  to  handle  line  segment 
features. 

5.2  Triangulation 

Using  the  computed  model-to-image  feature  matches  determined  via  epipolair  matching, 
multi-image  triangulation  is  performed  to  locate  new  3D  model  points  and  lines  in  the  model 
coordinate  system.  Currently,  only  code  for  triangulation  of  point  features  is  implemented. 
The  estimation  of  new  3D  points  can  be  done  in  either  batch  or  iterative  sequential  mode. 
Triangulation  requires  at  least  two  frames,  therefore  the  minimum  batch  size  is  two.  Results 
from  different  batches  can  be  be  integrated  by  the  standard  Kalman-filter  covariance-based 
updating  equations. 

Due  to  noise  both  in  image  measurements  and  camera  pose  estimates,  image  projection 
rays  will  not  exactly  intersect  at  a  point.  Kumar  has  developed  a  3D  pseudo-intersection 
method  that  minimizes  an  error  equation  based  on  the  same  constraints  that  determine  the 
pose  [25].  The  criterion  underlying  this  error  equation  is  that  the  best  estimate  for  any  model 
point  location  is  the  point  that  minimizes  the  least-squares  distance  between  the  predicted 
image  location  of  the  projected  model  point  and  its  actual  image  lo  ation,  taking  into  account 
covariances  in  the  measured  image  positions  and  the  computed  pose.  Two  non-hnear  error 
equations  are  obtained  for  each  scene  point  for  each  image  frame,  thus  a  minimum  of  two 
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frames  is  needed  to  solve  the  system  of  equations.  Techniques  for  the  solution  of  nonlinear 
^sterns  of  equations  generaUy  requir-e  an  initial  estimate  that  is  close  to  the  true  solution. 
The  initial  estimate  in  this  case  is  chosen  as  the  point  that  minimizes  the  sum  of  squares 
of  perpendicular  distances  to  all  the  image  projection  rays,  a  point  that  is  easily  found 
by  solving  a  Hnear  system  of  equations.  Using  this  initial  guess,  an  iterative  procedure  is 
employed  to  solve  the  system  of  non-linear  equations  for  each  point.  The  iterative  procedure 
IS  repeated  until  there  is  convergence.  Usually  only  one  iteration  is  sufficient  for  accurate 
results  [25].  A  byproduct  of  this  calculation  is  an  approximate  coveiriance  matrix  for  the 
derived  3D  model  point  position. 

The  method  described  above  can  also  be  used  for  model  refinement.  In  this  case  initial 
model  points  have  input  covariances  associated  with  them.  The  pseudo-intersection  method 
IS  used  to  calculate  a  new  estimate  for  each  initial  model  point.  The  covariance  matrices  of 
a  new  estimate  and  an  initial  model  point  are  used  to  fuse  the  two  estimates  and  provide  a 
new  uncertainty  matrix  using  the  standard  Kalman  filtering  equations. 

Kumar  assumes  that  the  lens  parameters  are  known,  so  only  uncertainty  in  the  pose 
parameter  estimates  is  considered  when  computing  the  error  in  a  triangulated  point  position. 
We  are  extending  this  approach  to  handle  uncertainty  in  the  camera  lens  parameters  as  well. 
We  are  also  extending  the  triangulation  equations  to  work  for  lines  as  well  as  points. 


5.3  Structured  Object  Recovery 

Model  extension  work  at  UMass  previous  to  the  RADIUS  project  was  based  on  tracking  point 
and  hne  features  through  a  motion  sequence  of  images  [26].  This  is  clearly  inappropriate  for 
the  present  application  where  images  from  widely  separated  viewpoints  are  used.  Instead, 
an  algorithm  to  perform  feature  matching  along  epipolar  lines  was  programmed  and  applied 
to  the  task  cf  determining  corresponding  corner  features  across  a  pair  of  images.  The  initial 
results  were  not  encouraging:  due  to  the  large  number  of  detected  corner  features  in  each 
image  and  the  imprecision  m  the  computed  epipolar  geometry,  many  ambiguous  potential 
matches  were  identified  for  each  corner.  This  is  a  common  problem,  and  a  number  of  solutions 
have  been  proposed  in  the  literature,  including  filtering  inappropriate  image  matches  based 
on  3D  constraints  and  on  epipolar  lines  from  3  or  more  views. 

Even  if  correct  epipolar  matching  of  individual  point  and  line  features  is  achieved,  sub¬ 
sequent  triangulation  using  these  features  yields  little  more  than  a  cloud  of  3D  corners  and 
lines.  Further  processing  is  still  needed  to  merge  these  primitive  3D  features  into  structured 
objects  of  interest  (such  as  buildings).  Because  of  these  considerations,  we  are  now  exploring 
an  alternative  approach  based  on  more  structured  sets  of  image  features,  such  as  connected 
components  of  lines  and  corners  that  form  a  partied  or  complete  building  wire  frame.  A 
project  is  now  underway  to  perform  perceptual  grouping  of  lines  and  corners  into  poly¬ 
gons  and  wire  frame  structures,  match  them  across  multiple  images  using  the  precomputed 
epipolar  geometry,  then  finally  triangulate  these  structures  using  multiple  images  and  3D 
geometric  object  constraints  such  as  incidence,  parallelism,  perpendicularity  and  coplanarity. 
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6  Image- to-Image  Registration 


Image-toimage  registration  involves  determining  feature  correspondences  between  two  or 
more  images  of  the  same  scene,  along  with  computation  of  the  relative  pose  between  them. 
Clearly,  if  a  model  of  the  scene  is  available,  model-to-image  registration  (Section  4)  could 
be  applied  to  each  image  sepcirately,  and  then  the  relative  pose  between  images  could  be 
computed  from  each  image’s  absolute  pose.  However,  prior  to  initial  model  acquisition,  this 
is  not  an  option.  If  the  pose  for  each  image  is  already  known,  then  features  can  be  matched 
between  images  using  an  epipolar  matching  algorithm  as  described  in  Section  5,  even  when 
no  3D  model  is  available.  But  when  presented  with  a  new  set  of  images  of  an  unmodeled 
site,  neither  of  these  approaches  is  applicable  and  another  more  general  technique  must  be 
used.  Fast  and  reliable  general  image-to-image  matching  techniques  exist  when  the  distance 
between  views  is  small  [2]  or  when  features  can  be  tracked  through  a  motion  sequence  [26]. 
What  is  lacking  are  good  methods  for  finding  matches  in  monocular  images  taken  from 
arbitrary  viewpoints,  as  is  the  case  in  the  RADIUS  domain. 

In  this  section  a  simple  method  is  presented  that  allows  fast  and  accurate  matching  of 
coplanar  structures  across  multiple  images.  We  show  that  the  full  perspective  matching 
problem  for  horizontal  coplanzu-  structures  can  be  reduced  to  a  simpler  four  parameter  affine 
(similarity)  matching  problem  when  the  horizon  line  of  the  scene  can  be  determined  in  the 
image.  Given  the  horizon  line,  the  image  can  be  transformed  to  show  how  the  scene  would 
appear  if  the  camera’s  principle  axis  was  directed  strciight  down,  parallel  to  gravity.  That  is, 
given  the  horizon  line  in  an  aeriail  photograph,  the  image  can  be  aurtificiaUy  warped  to  produce 
a  nadir  view  of  the  same  scene.  This  process  is  called  rectification  in  aeriail  photogrammetry. 

6.1  Nadir  Views 

Nadir  views  are  very  popular  in  aerial  image  understanding  systems  because  they  simplify  the 
description  of  the  imaging  process;  for  example,  flat,  rectangular  building  rooftops  appear  as 
rectangles  in  the  image.  Constraining  the  camera  orientation  to  take  a  nadir  view  fixes  two 
degrees  of  its  rotational  freedom.  The  four  remaining  degrees  of  freedom,  one  free  camera 
rotation  about  the  principle  axis  and  three  translation  parameters,  can  be  chciracterized  by 
how  they  affect  the  appeairance  of  a  horizontal  plane  (such  as  a  building  roof)  in  the  image. 
For  example,  translation  directly  towards  ot^kway  from  the  object  plane  manifests  itself  as 
a  uniform  change  of  scale  in  the  projected  image.  Translation  parallel  to  the  planzir  surface 
shows  up  as  a  proportioned  2D  translation  in  the  image.  Finally,  a  rotation  of  the  camera 
about  its  principle  axis  causes  the  projected  image  to  rotate  by  the  same  angle  about  a  point 
in  the  image  plane.  The  projected  image  of  a  horizontal  plane  in  a  nadir  view  is  therefore 
described  by  four  affine  parameters  that  are  directly  related  to  the  physical  pose  of  the 
camera.  In  other  words,  the  function  that  maps  object  coordinates  to  image  coordinates  for 
a  horizontal  planar  structure  in  a  nadir  view  is  a  four  pcirameter  affine  mapping,  often  called 
a  similarity  mapping. 

For  oblique  views,  the  function  that  maps  horizontal  object  coordinates  into  image  coor¬ 
dinates  is  no  longer  a  similarity  mapping,  but  is  instead,  a  more  general  projective  transfor- 
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mation  [17].  Parallel  roof  Hues  then  appear  to  converge  in  the  image  plane,  intersecting  at 
a  vanishing  point.  Two  or  more  vanishing  points  from  different  sets  of  horizontal,  paraUel 
nes,  orm  a  ne  in  the  image  called  the  vanishing  line  or  horizon  line.  In  nadir  views  all 
onzontals  remain  parailei  in  the  image,  and  the  horizon  line  is  said  to  be 
at  mftnity  .  This  convention  allows  the  relation  between  oblique  and  nadir  views  to  be 
precisely  stated.  Parallel  horizontal  lines  in  the  scene  map  to  Unes  in  the  image  that  are  di¬ 
rected  towards  a  vanishing  point  that  lies  on  the  horizon  line.  Nadir  views  are  distinguished 
from  obhque  views  in  that  the  horizon  line  is  located  at  infinity.  In  this  case,  the  projective 

transformation  mapping  object  coordinates  from  a  horizontal  plane  into  image  coordinates 
IS  a  similarity  transform. 

These  considerations  lead  to  a  simple  yet  powerful  observation.  By  applying  a  projective 
ransformation  that  maps  the  horizon  line  in  the  image  to  the  fine  at  infinity,  the  vanishing 
points  of  ^  horizontal  fines  will  also  appear  at  infinity.  After  this  mapping  has  been  per- 
formed,  aU  parallel  horizontal  fines  in  scene  wifi  appear  paraUel  in  the  image.  This  implies 
that  the  new  image  is  a  nadir  view  of  the  scene,  and  thus,  the  projective  mapping  of  scene 
horizontals  into  the  image  can  be  represented  as  a  similarity  transform. 


6.2  Rectification 

To  take  advantage  of  the  simplicity  of  nadir  views  for  high  altitude  photographs,  we  designed 
a  two  stage  approacn  to  matching  planar  structures  viewed  from  any  orientation  The 
approach  consists  of  applying  information-preserving  transformations  to  the  image  to  reduce 

loLeT^  T  ^  matching  problem.  First,  vanishing  point  analysis  is  pLformed  to 

locate  the  horizon  fine  in  the  image  (see  Section  3),  and  a  projective  transformation  is 
pplied  that  rectifies  the  image.  Second,  the  scene  is  assumed  to  be  approximately  planar 
h  respect  to  camera  altitude,  and  image  features  are  matched  using  a  2D  similarity 
ransformation  version  of  the  local  search  matcher  described  in  Section  4. 

th.^n  ^  orientation  of  the  horizon  fine  determines 

the  3D  orientation  of  the  camera  with  respect  to  gravity.  When  the  equation  of  the  horizon 
fine  is  ai  +  6t/  -h  c  =  0,  the  gravity  vector,  in  camera  coordinates,  is 


n  =  (a, 6,  c)/||(a,6,  c) 


(3) 


For  a  nadir  view,  the  gravity  vector  must  be  parallel  to  the  2-axis  of  the  camera.  If  the 
camera  could  move,  an  oblique  view  could  be  changed  to  a  nadir  view  by  merely  rotating  the 
camera  to  point  straight  down.  The  camera  can  no  longer  be  moved  physically,  of  course 
but  the  image  can  be  transformed  artificaUy  to  achieve  the  desired  3D  rotation. 

Assume  the  unit  orientation  of  the  gravity  vector  has  been  determined  to  be  n,  as  in 
equation  3,  oriented  into  the  image  (c  >  0).  To  bring  this  vector  into  coincidence  with 

n  X  ToTu  5  r'  f  ^  ‘  1)) 

Ihe  effects  of  this  camera  rotation  on  the  image  can  be.  simulated  by  an 
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invertible  projective  transformation  in  the  image  plane  [231.  In  homogeneous  coordinates, 
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The  image  is  transformed  to  appear  as  if  the  camera  had  been  pointing  straight  down, 
parallel  to  gravity.  The  result,  therefore,  is  a  rectified  nadir  view  of  the  scene. 


6.3  Example 

Because  it  rehes  on  2D  image  properties,  image  rectification  is  applicable  when  no  prior 
scene  model  is  available.  It  thus  lends  itself  well  to  the  problem  of  image-to-image  matching 
of  horizontal  coplanar  structures.  In  this  case,  both  images  are  rectified  using  vanishing 
point  information,  and  one  is  treated  as  the  model  while  the  other  becomes  the  data  to  be 
matched.  The  goal  is  then  to  discover  the  similarity  transformation  that  maps  one  set  of 
rectified  image  lines  into  another.  We  use  the  loceJ  seairch  matching  system  described  in 
Section  4.  Although  aerial  images  do  not  often  depict  purely  planar  scenes,  the  method 
can  stiU  be  used  to  get  coarse  matching  results  in  cases  such  as  high-eiltitude  photos,  or  for 
mapping  road  networks  on  a  nezirly  flat  ground  plane. 

Figure  13  shows  an  example  of  rectification-based  aerial  image  registration.  Figures  13a 
and  13b  show  sets  of  streiight  line  segments  extracted  from  two  aerial  photographs  of  Fort 
Hood,  Texas.  The  first  image  presents  a  nadir  view  of  the  scene,  a  fact  verified  by  vanishing 
point  analysis,  which  finds  two  orthogonal  sets  of  nesirly  parallel  lines.  The  second  image  is 
clearly  not  a  nadir  view,  a  fact  again  verified  via  vanishing  point  analysis.  Figure  13c  shows 
these  image  lines  after  applying  the  automated  image  rectification  procedure  described  above. 
To  unwarp  the  second  image  a  generic  camera  model  was  assumed,  placing  the  principle  point 
in  the  center  of  the  image  and  optical  axes  along  the  row  and  column  axes  of  the  raster  image. 
The  focal  length  was  computed  by  finding  the  distance  of  the  focal  point  from  the  image  that 
resulted  in  perpendicularity  of  the  two  vectors  from  the  (variable)  focal  point  towards  the 
two  (fixed)  vanishing  points  in  the  image.'  The  aspect  ratio  was  assumed  to  be  one-to-one. 

To  apply  locail  search  matching,  image  1  was  assumed  to  be  the  model,  and  rectified  lines 
from  image  2,  the  data.  Both  line  sets  were  filtered  to  include  only  lines  greater  than  100 
pixels  long,  reducing  the  matching  problem  to  55  long  lines  in  one  image  and  68  lines  in  the 
other.  Additionally,  the  search  space  was  partitioned  based  upon  the  dominant  orthogonal 
line  directions.  The  best  match  found  is  displayed  in  Figure  13d. 


6.4  Limitations 

Although  automated  image  rectification  based  on  vanishing  point  analysis  works  well  in 
urban  and  industrial  aerial  image  domains,  the  combination  of  rectification  followed  by 
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Figure  13.  Image-to-image  matching  exampi 
view,  (b)  image  lines  from  oblique  view,  (c) 
view  with  rectified  oblique  view. 


e  on  an  aerial  image:  (a)  image  lines  from  nadir 
rectified  oblique  view,  (d)  registration  of  nadir 
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similarity  matching  to  determine  image-to-image  feature  correspondences  is  limited.  The 
approach  works  well  for  relatively  high-altitude  photographs,  like  the  Fort  Hood,  Texas 
images  shown  here,  where  both  ground-level  structures  and  building  rooftops  can  be  treated 
as  nearly  coplanar.  The  approach  does  not  work  well  on  the  model  board  images,  however 
since  these  images  were  taken  at  a  lower  altitude  where  the  3D  structure  of  the’ buildings’ 
becomes  very  apparent.  As  images  in  the  RADIUS  domain  are  expected  to  be  more  like 
the  model  board  images  than  the  Fort  Hood  ones,  the  method  for  image-to-image  matching 
described  in  this  section  is  no  longer  being  pursued. 

However,  to  the  extent  that  some  scene  features  axe  found  to  be  coplanar  and  can  be 
successfully  matched,  this  initial  set  of  planar  correspondences  provides  strong  constreiints 
on  the  positions  of  remaining  features.  For  caUbrated  cameras,  the  relative  rotation  and 
direction  of  translation  between  two  camera  positions  can  be  computed  from  the  perspective 
transformation  describing  how  the  appearance  of  a  planar  structure  differs  in  the  two  images 
[17].  This  reduces  the  search  for  other  3D  feature  correspondences  to  that  of  induced  stereo, 
where  corresponding  feature  points  lie  along  known  epipolar  lines  (see  Section  5). 
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7  Projective  Structure  Recovery 


In  addition  to  traditional  metric  approaches  to  scene  reconstruction  via  pose  determination 
and  triangulation,  we  are  pursuing  a  complementary  research  path  that  investigates  the  use 
of  projective  invariants  for  image  registration,  image  transference,  and  scene  reconstruction. 
The  benefit  of  this  approach  is  that  dependence  on  initial  estimates  of  the  camera  pose 
parameters  are  minimized.  Experimentally,  we  have  noticed  that  techniques  based  on  gen¬ 
eralizing  2D  planar  invariants  to  account  for  non-coplancirity  seem  more  robust  than  those 
based  on  full  3D  to  2D  invariants  (e.g.  methods  based  on  the  essential  matrix).  We  are 
working  under  the  hypothesis  that  this  is  so  because  the  scene  features  projected  onto  an 
aerial  photograph  are  nearly  coplanar  with  respect  to  distance  from  the  camera.  Full  3D  ap¬ 
proaches  work  best  for  sets  of  point  positions  that  vary  significantly  in  all  three  dimensions, 
but  often  fail  when  the  points  are  coplanar.  One  should  not,  therefore,  expect  such  methods 
to  work  well  on  aerial  images. 


7.1  Invariant-Based  Model  Extension 

Previous  work  by  Collins  investigated  invariant  recovery  of  planar  scene  features  [13].  Given 
knowledge  of  at  least  four  coplanar  points  or  lines  in  the  scene  and  their  correspondence  in 
an  image,  a  planar  projective  transformation  (called  a  homography)  that  maps  new  points 
and  lines  from  the  image  into  their  proper  locations  on  the  object  plane  can  be  estimated, 
without  first  computing  pose  or  Ccilibrating  the  camera. 

When  other,  non-coplanar  scene  features  are  present,  the  mapping  from  world  to  image 
is  no  longer  completely  described  by  a  homography,  cind  planar  model  extension  is  no  longer 
cipplicable.  We  have  begun  to  explore  approaches  for  extending  planar  model  extension  to 
handle  more  general  3D  scenes.  Assume  that  two  views  of  a  scene  are  available,  that  at 
least  four  coplanar  points  or  lines  are  available  for  use  as  a  reference  plane,  and  that  an 
image-to-image  homography  has  been  computed  that  transforms  the  projections  of  reference 
plane  point  features  from  image  one  into  their  corresponding  projections  in  image  two. 
What  can  be  said  about  the  transformed  image  of  a  scene  point  outside  of  the  reference 
plane?  The  location  predicted  for  it  by  the  planar  homography  will  not,  in  general,  coincide 
with  the  actual  location  of  the  projected  scene  point  in  image  two.  However,  the  residual 
difference  between  the  predicted  and  actual'^ositions  of  the  projected  point  in  image  two  are 
constrained  to  lie  along  lines  intersecting  in  a  single  point.  These  lines  are  called  epipolar 
lines,  and  they  intersect  at  the  epipole,  which  geometrically  corresponds  to  the  image  of 
the  focal  point  of  the  camera  when  image  one  was  taken.  Furthermore,  the  direction  of 
each  residual  difference  vector,  either  towards  or  away  from  the  epipole,  determines  whether 
the  corresponding  3D  scene  point  lies  either  forwcird  or  behind  the  plane  of  reference.  These 
results  are  valid  regardless  of  either  camera’s  location,  orientation,  or  calibration  parameters. 

For  illustration,  two  aerial  photographs  are  shown  in  Figures  14a  and  14b.  Figures  14c 
and  14d  show  thirty  seven  corresponding  pairs  of  points  that  were  chosen  by  hand  from 
these  images.  Several  sets  of  points  delimit  the  tops  of  buildings,  and  are  therefore  copluiiar 
in  the  scene.  The  four  pairs  of  coplanar  points  marked  with  a  cross  dehneate  the  corners 
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Figure  1.  Two  aerial  photographs  that  are  not  well-approximated  by  a  single  plane  (a) 
RADIUS  Model  Board  1,  Image  J8.  (b)  RADIUS  Model  Board  1,  Image  J2.  (c)  Interesting 
points  extracted  by  hand  from  Image  J8.  (d)  Corresponding  points  extracted  by  hand  from 
mage  J2.  Some  building  boundaries  have  been  added  for  clarity.  Crosses  mark  points  that 
will  be  used  to  estimate  a  homography  between  the  two  images. 
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of  a  rooftop,  and  these  were  used  to  estimate  a  homography  from  the  first  image  into  the 
second.  All  points  from  the  first  image  .were  then  mapped  into  the  second  image  using  this 
homography,  and  their  positions  were  noted.  Figure  15  shows  residual  difference  vectors 
between  predicted  locations  (in  black)  of  transformed  points  from  image  one,  and  actual 
point  locations  (in  white)  where  they  were  found  in  image  two.  The  four  pairs  of  coplanar 
reference  points  align  exactly,  as  they  must  by  definition  of  the  homography. 


Figure  15:  Difference  vectors  between  points  from  image  one  transformed  by  a  planar  ho¬ 
mography  into  image  two  (black  dots),  and  their  corresponding  actual  positions  in  image 
two  (white  dots).  Points  marked  with  crosses  were  used  to  define  the  homography. 

All  remrdning  residual  vectors  lie  along  infinite  epipolar  lines  that  intersect  at  a  single 
epipolar  point,  which  in  this  case,  is  far  off  the  image.  Furthermore,  note  that  the  difference 
vectors  for  structures  taller  than  the  rooftop  used  to  compute  the  homography  are  oriented 
in  one  direction,  while  difference  vectors  for  structures  shorter  than  the  rooftop  are  oriented 
in  the  opposite  direction.  This  property  holJs  in  general,  and  can  be  used  to  qualitatively 
partition  scene  points  into  three  categories  depending  on  the  orientation  of  their  residucils: 
those  lying  closer  to  the  viewer  than  the  reference  plane,  those  lying  on  the  plane  (difference 
vector  is  zero),  and  those  lying  further  away. 


7.2  Approximate  Height  Recovery 

A  more  detailed  analysis  determines  the  extent  to  which  quantitative  information  can  be 
recovered  from  this  geometric  configuration.  This  analysis  is  published  in  [14],  and  will  not 
be  repeated  here.  The  results  of  the  analysis  show  that  when  the  epipole  is  at  infinity  (and 
thus  the  residual  difference  vectors  are  parallel  in  the  image)  the  length  of  each  residual 
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vector  is  directly  proportional  to  the  perpendicular  distance  of  the  corresponding  3D  scene 
point  from  the  reference  plane.  If  the  reference  plane  has  been  chosen  as  some  horizonted 
plane,  such  as  the  top  of  a  building  or  the  ground  plane,  then  the  relative  heights  of  each 
scene  point  can  be  determined.  When.fhe  epipole  is  not  exactly  at  infinity,  but  is  still  far 
away  from  the  center  of  the  image,  then  the  relative  heights  of  scene  points  are  recovered 
approximately.  For  a  horizontal  reference  plcine,  situations  leading  to  nearly  parallel  residuail 
difference  vectors  include  sequences  of  images  taken  from  camera  positions  at  roughly  the 
Scime  zJtitude.  Because  this  method  is  based  on  projective  invariants,  the  exact  altitude, 
orientation,  and  intrinsic  calibration  parameters  of  the  camera  (or  cameras)  do  not  need  to 
be  known. 

An  initial  test  into  the  feasibility  of  this  approach  to  height  recovery  was  conducted  using 
model  board  images  J2  and  J8.  The  four  corners  points  of  building  34  (the  one  with  many 
long,  parallel  roof  vents)  were  identified  in  both  images  and  the  projective  transformation 
relating  the  two  sets  of  points  was  computed.  This  transformation  was  used  to  map  several 
building  corner  points  from  image  J2  into  image  J8,  and  the  lengths  of  the  residucJ  difference 
vectors  between  their  transformed  positions  and  actual  observed  locations  in  image  J8  were 
measured.  As  described  above,  these  lengths  should  be  approximately  proportioned  to  the 
heights  of  each  point  with  respect  to  the  height  of  building  34.  To  compare  the  computed 
relative  heights  to  known  ground  truth  heights  of  each  point,  the  single  scale  factor  relating 
residucd  vector  lengths  to  model  board  Z  coordinate  values  was  estimated,  and  the  resulting 
Z  values  were  then  converted  to  absolute  height  above  a  nominsd  ground  plane  of  Z=-2 
inches.  The  average  relative  error  of  the  computed  heights  of  23  test  points  was  about  4.8%. 
The  results  are  shown  in  Table  2,  where  for  ease  of  interpretation,  adl  model  bocurd  units  (in 
inches)  have  been  converted  into  world  units  (in  feet)  using  the  fact  that  the  model  board  is 
a  1:500  inch  scale  model. 

Table  2;  Comparison  of  ground  truth  point  height  above  the  ground  plane  versus  height 
computed  via  projective  invariants  for  23  selected  test  points  on  model  board  1.  See  text  for 
details.  All  heights  are  reported  in  feet. 
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